<!DOCTYPE html>
<html>
<head>
  <meta charset="utf-8">
  
  
  <title>Hexo</title>
  <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
  <meta property="og:type" content="website">
<meta property="og:title" content="Hexo">
<meta property="og:url" content="https://chudod.gitee.io/myworks/page/2/index.html">
<meta property="og:site_name" content="Hexo">
<meta property="og:locale" content="en_US">
<meta property="article:author" content="John Doe">
<meta name="twitter:card" content="summary">
  
    <link rel="alternate" href="/myworks/atom.xml" title="Hexo" type="application/atom+xml">
  
  
    <link rel="shortcut icon" href="/myworks/favicon.png">
  
  
    
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/typeface-source-code-pro@0.0.71/index.min.css">

  
  
<link rel="stylesheet" href="/myworks/css/style.css">

  
    
<link rel="stylesheet" href="/myworks/fancybox/jquery.fancybox.min.css">

  
<meta name="generator" content="Hexo 6.2.0"></head>

<body>
  <div id="container">
    <div id="wrap">
      <header id="header">
  <div id="banner"></div>
  <div id="header-outer" class="outer">
    <div id="header-title" class="inner">
      <h1 id="logo-wrap">
        <a href="/myworks/" id="logo">Hexo</a>
      </h1>
      
    </div>
    <div id="header-inner" class="inner">
      <nav id="main-nav">
        <a id="main-nav-toggle" class="nav-icon"></a>
        
          <a class="main-nav-link" href="/myworks/">Home</a>
        
          <a class="main-nav-link" href="/myworks/archives">Archives</a>
        
      </nav>
      <nav id="sub-nav">
        
          <a id="nav-rss-link" class="nav-icon" href="/myworks/atom.xml" title="RSS Feed"></a>
        
        <a id="nav-search-btn" class="nav-icon" title="Search"></a>
      </nav>
      <div id="search-form-wrap">
        <form action="//google.com/search" method="get" accept-charset="UTF-8" class="search-form"><input type="search" name="q" class="search-form-input" placeholder="Search"><button type="submit" class="search-form-submit">&#xF002;</button><input type="hidden" name="sitesearch" value="https://chudod.gitee.io/myworks"></form>
      </div>
    </div>
  </div>
</header>

      <div class="outer">
        <section id="main">
  
    <article id="post-OpenMPI&amp;MPICH_Allreduce" class="h-entry article article-type-post" itemprop="blogPost" itemscope itemtype="https://schema.org/BlogPosting">
  <div class="article-meta">
    <a href="/myworks/2022/04/13/OpenMPI&MPICH_Allreduce/" class="article-date">
  <time class="dt-published" datetime="2022-04-12T16:00:00.000Z" itemprop="datePublished">2022-04-13</time>
</a>
    
  <div class="article-category">
    <a class="article-category-link" href="/myworks/categories/%E5%B9%B6%E8%A1%8C%E8%AE%A1%E7%AE%97/">并行计算</a>►<a class="article-category-link" href="/myworks/categories/%E5%B9%B6%E8%A1%8C%E8%AE%A1%E7%AE%97/MPI%E9%9B%86%E5%90%88%E9%80%9A%E4%BF%A1%E7%AE%97%E6%B3%95/">MPI集合通信算法</a>
  </div>

  </div>
  <div class="article-inner">
    
    
      <header class="article-header">
        
  
    <h1 itemprop="name">
      <a class="p-name article-title" href="/myworks/2022/04/13/OpenMPI&MPICH_Allreduce/">MPI_Allreduce的在OpenMPI、MPICH中的实现</a>
    </h1>
  

      </header>
    
    <div class="e-content article-entry" itemprop="articleBody">
      
        <h1 id="Allreduce算法列表"><a href="#Allreduce算法列表" class="headerlink" title="Allreduce算法列表"></a>Allreduce算法列表</h1><h2 id="1、OpenMPI-4-1中的MPI-Allreduce算法"><a href="#1、OpenMPI-4-1中的MPI-Allreduce算法" class="headerlink" title="1、OpenMPI-4.1中的MPI_Allreduce算法"></a>1、OpenMPI-4.1中的MPI_Allreduce算法</h2><p>$\verb+OpenMPI-4.1.2+$是最新版本的$\verb+OpenMPI+$，算法的具体选择在$\verb+ompi&#x2F;mca&#x2F;coll&#x2F;tuned&#x2F;coll_tuned_decision_fixed.c+$和$\verb+ompi&#x2F;mca&#x2F;coll&#x2F;tuned&#x2F;coll_tuned_decision_dynamic.c+$文件里，用户可以指定规则以及选择使用的算法，并且$\verb+OpenMPI+$使用了6种算法，分别是</p>
<figure class="highlight cpp"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line">Algorithms:</span><br><span class="line">   &#123;<span class="number">1</span>, <span class="string">&quot;basic_linear&quot;</span>&#125;: Reduce + Broadcast</span><br><span class="line">   &#123;<span class="number">2</span>, <span class="string">&quot;nonoverlapping&quot;</span>&#125;: Reduce +Broadcast</span><br><span class="line">   &#123;<span class="number">3</span>, <span class="string">&quot;recursive_doubling&quot;</span>&#125;: Recursive Doubling</span><br><span class="line">   &#123;<span class="number">4</span>, <span class="string">&quot;ring&quot;</span>&#125;: <span class="built_in">Ring</span>(Segmented Messages) + <span class="built_in">Allgather</span>(Ring)</span><br><span class="line">   &#123;<span class="number">5</span>, <span class="string">&quot;segmented_ring&quot;</span>&#125;: Segmented Ring</span><br><span class="line">   &#123;<span class="number">6</span>, <span class="string">&quot;rabenseifner&quot;</span>&#125;: Reduce-Scatter + Allgather</span><br><span class="line">   <span class="comment">/* Currently, ring, segmented ring, and rabenseifner do not support non-commutative operations. */</span></span><br></pre></td></tr></table></figure>
<p>默认使用$\verb+&#x2F;coll_tuned_decision_fixed.c+$里的规则，具体的选择方法如下(原代码是100多行的$else-if$，贼暴力)：<br><img src="https://note.youdao.com/yws/api/personal/file/WEBef75c5797a8fe9a7d3afe7d9f84db3a2?method=download&shareKey=0412dcb1d5f95516723864a4f1a48a13"><br><img src="https://note.youdao.com/yws/api/personal/file/WEBe38b3f100dd6599ec413faa1ee25edcf?method=download&shareKey=0412dcb1d5f95516723864a4f1a48a13"><br>除了默认的规则之外，用户还可以指定参数来选择对应的算法。  </p>
<h2 id="2、MPICH中的MPI-Allreduce算法"><a href="#2、MPICH中的MPI-Allreduce算法" class="headerlink" title="2、MPICH中的MPI_Allreduce算法"></a>2、MPICH中的MPI_Allreduce算法</h2><p>$\verb+MPICH-4.0.2+$是最新版本的$\verb+MPICH+$，在$\verb+MPICH+$中调用$\verb+MPIR_Csel_search+$函数来确定参数，$\verb+MPIR_Csel_search+$返回的值决定采用何种算法。算法的选择逻辑在$\verb+mpich-4.0.2&#x2F;src&#x2F;mpi&#x2F;coll&#x2F;mpir_coll.c+$的4496行的$\verb+MPIR_Allreduce_allcomm_auto+$函数处，$\verb+MPICH+$中的5种算法如下：</p>
<figure class="highlight cpp"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line">Algorithms:</span><br><span class="line">    <span class="number">1</span>-&gt; <span class="string">&quot;MPIR_Allreduce_intra_smp&quot;</span>: Reduce + Broadcast;</span><br><span class="line">    <span class="number">2</span>-&gt; <span class="string">&quot;MPIR_Allreduce_intra_recursive_doubling&quot;</span>: Recursive Doubling;</span><br><span class="line">    <span class="number">3</span>-&gt; <span class="string">&quot;MPIR_Allreduce_intra_reduce_scatter_allgather&quot;</span>: Scatter + Allgather;</span><br><span class="line">    <span class="number">4</span>-&gt; <span class="string">&quot;MPIR_Allreduce_inter_reduce_exchange_bcast&quot;</span>: a variant of Reduce + Broadcast;</span><br><span class="line">    <span class="number">5</span>-&gt; <span class="string">&quot;MPIR_Allreduce_allcomm_nb&quot;</span>: Nonblocking algorithm, namely MPIR_Iallreduce;</span><br></pre></td></tr></table></figure>

<h1 id="Allreduce算法实现"><a href="#Allreduce算法实现" class="headerlink" title="Allreduce算法实现"></a>Allreduce算法实现</h1><h2 id="1、-Reduce-Broadcast-算法"><a href="#1、-Reduce-Broadcast-算法" class="headerlink" title="1、$Reduce$+$Broadcast$算法"></a>1、$Reduce$+$Broadcast$算法</h2><h3 id="OpenMPI版"><a href="#OpenMPI版" class="headerlink" title="OpenMPI版"></a>OpenMPI版</h3><h4 id="basic-linear函数"><a href="#basic-linear函数" class="headerlink" title="basic_linear函数"></a>basic_linear函数</h4><figure class="highlight cpp"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="type">int</span> <span class="title">ompi_coll_base_allreduce_intra_basic_linear</span><span class="params">(<span class="type">const</span> <span class="type">void</span> *sbuf, <span class="type">void</span> *rbuf, <span class="type">int</span> count, </span></span></span><br><span class="line"><span class="params"><span class="function">                                                <span class="keyword">struct</span> <span class="type">ompi_datatype_t</span> *dtype, </span></span></span><br><span class="line"><span class="params"><span class="function">                                                <span class="keyword">struct</span> <span class="type">ompi_op_t</span> *op, </span></span></span><br><span class="line"><span class="params"><span class="function">                                                <span class="keyword">struct</span> <span class="type">ompi_communicator_t</span> *comm,</span></span></span><br><span class="line"><span class="params"><span class="function">                                                <span class="type">mca_coll_base_module_t</span> *<span class="keyword">module</span>)</span></span></span><br><span class="line"><span class="function"></span>&#123;</span><br><span class="line">    <span class="comment">/* Reduce to 0 and broadcast. */</span></span><br><span class="line">    <span class="keyword">if</span> (MPI_IN_PLACE == sbuf) &#123;</span><br><span class="line">        <span class="keyword">if</span>(<span class="number">0</span> == rank)  </span><br><span class="line">            err = <span class="built_in">ompi_coll_base_reduce_intra_basic_linear</span> (MPI_IN_PLACE, rbuf, count, dtype, op, <span class="number">0</span>, comm, <span class="keyword">module</span>);</span><br><span class="line">        <span class="keyword">else</span> </span><br><span class="line">            err = <span class="built_in">ompi_coll_base_reduce_intra_basic_linear</span>(rbuf, <span class="literal">NULL</span>, ...);</span><br><span class="line">    &#125; </span><br><span class="line">    <span class="keyword">else</span> &#123;</span><br><span class="line">        err = <span class="built_in">ompi_coll_base_reduce_intra_basic_linear</span>(sbuf, rbuf, ...);</span><br><span class="line">    &#125;</span><br><span class="line">    <span class="keyword">if</span> (MPI_SUCCESS != err) <span class="keyword">return</span> err;</span><br><span class="line">    <span class="keyword">return</span> <span class="built_in">ompi_coll_base_bcast_intra_basic_linear</span>(rbuf, count, dtype, <span class="number">0</span>, comm, <span class="keyword">module</span>);</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure>
<p>其中$\verb+MPI_IN_PLACE+$参数用在$\verb+MPI_GATHER+$、$\verb+MPI_Reduce+$等有$\verb+send_buf+$和$\verb+recv_buf+$的函数中，用来代替$\verb+send_buf+$，说明当前进程既发送又接受数据，而且要发送的数据和在要接收的数据的保存在同一内存。</p>
<h4 id="nonoverlapping函数"><a href="#nonoverlapping函数" class="headerlink" title="nonoverlapping函数"></a>nonoverlapping函数</h4><figure class="highlight cpp"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="type">int</span> <span class="title">ompi_coll_base_allreduce_intra_nonoverlapping</span><span class="params">(<span class="type">const</span> <span class="type">void</span> *sbuf, <span class="type">void</span> *rbuf, <span class="type">int</span> count, </span></span></span><br><span class="line"><span class="params"><span class="function">                                                <span class="keyword">struct</span> <span class="type">ompi_datatype_t</span> *dtype, </span></span></span><br><span class="line"><span class="params"><span class="function">                                                <span class="keyword">struct</span> <span class="type">ompi_op_t</span> *op, </span></span></span><br><span class="line"><span class="params"><span class="function">                                                <span class="keyword">struct</span> <span class="type">ompi_communicator_t</span> *comm,</span></span></span><br><span class="line"><span class="params"><span class="function">                                                <span class="type">mca_coll_base_module_t</span> *<span class="keyword">module</span>)</span></span></span><br><span class="line"><span class="function"></span>&#123;</span><br><span class="line">    rank = <span class="built_in">ompi_comm_rank</span>(comm);</span><br><span class="line"></span><br><span class="line">    <span class="comment">//第一个参数是sbuf 或者是 MPI_IN_PLACE</span></span><br><span class="line">    comm-&gt;c_coll-&gt;<span class="built_in">coll_reduce</span> (sbuf, rbuf, count, dtype, op, <span class="number">0</span>, comm, comm-&gt;c_coll-&gt;coll_reduce_module);</span><br><span class="line"></span><br><span class="line">    <span class="keyword">if</span> (MPI_SUCCESS != err) <span class="keyword">return</span> err;</span><br><span class="line"></span><br><span class="line">    <span class="keyword">return</span> comm-&gt;c_coll-&gt;<span class="built_in">coll_bcast</span>(rbuf, count, dtype, <span class="number">0</span>, comm, comm-&gt;c_coll-&gt;coll_bcast_module);</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure>
<h3 id="MPICH版"><a href="#MPICH版" class="headerlink" title="MPICH版"></a>MPICH版</h3><h4 id=""><a href="#" class="headerlink" title=""></a></h4><h2 id="2、-Recursive-Doubling-算法"><a href="#2、-Recursive-Doubling-算法" class="headerlink" title="2、$Recursive$ $Doubling$算法"></a>2、$Recursive$ $Doubling$算法</h2><h3 id="OpenMPI版-1"><a href="#OpenMPI版-1" class="headerlink" title="OpenMPI版"></a>OpenMPI版</h3><h4 id="recursivedoubling函数"><a href="#recursivedoubling函数" class="headerlink" title="recursivedoubling函数"></a>recursivedoubling函数</h4><figure class="highlight cpp"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br><span class="line">84</span><br><span class="line">85</span><br><span class="line">86</span><br><span class="line">87</span><br><span class="line">88</span><br><span class="line">89</span><br><span class="line">90</span><br><span class="line">91</span><br><span class="line">92</span><br><span class="line">93</span><br><span class="line">94</span><br><span class="line">95</span><br><span class="line">96</span><br><span class="line">97</span><br><span class="line">98</span><br><span class="line">99</span><br><span class="line">100</span><br><span class="line">101</span><br><span class="line">102</span><br><span class="line">103</span><br><span class="line">104</span><br><span class="line">105</span><br><span class="line">106</span><br><span class="line">107</span><br><span class="line">108</span><br><span class="line">109</span><br><span class="line">110</span><br><span class="line">111</span><br><span class="line">112</span><br><span class="line">113</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">/*         算法保持了规约操作的顺序，所以可支持满足以及不满足交换律的操作</span></span><br><span class="line"><span class="comment"> *</span></span><br><span class="line"><span class="comment"> *         Example on 7 nodes:</span></span><br><span class="line"><span class="comment"> *         Initial state</span></span><br><span class="line"><span class="comment"> *         #      0       1      2       3      4       5      6</span></span><br><span class="line"><span class="comment"> *               [0]     [1]    [2]     [3]    [4]     [5]    [6]</span></span><br><span class="line"><span class="comment"> *         Initial adjustment step for non-power of two nodes.</span></span><br><span class="line"><span class="comment"> *         old rank      1              3              5      6</span></span><br><span class="line"><span class="comment"> *         new rank      0              1              2      3</span></span><br><span class="line"><span class="comment"> *                     [0+1]          [2+3]          [4+5]   [6]</span></span><br><span class="line"><span class="comment"> *         Step 1</span></span><br><span class="line"><span class="comment"> *         old rank      1              3              5      6</span></span><br><span class="line"><span class="comment"> *         new rank      0              1              2      3</span></span><br><span class="line"><span class="comment"> *                     [0+1+]         [0+1+]         [4+5+]  [4+5+]</span></span><br><span class="line"><span class="comment"> *                     [2+3+]         [2+3+]         [6   ]  [6   ]</span></span><br><span class="line"><span class="comment"> *         Step 2</span></span><br><span class="line"><span class="comment"> *         old rank      1              3              5      6</span></span><br><span class="line"><span class="comment"> *         new rank      0              1              2      3</span></span><br><span class="line"><span class="comment"> *                     [0+1+]         [0+1+]         [0+1+]  [0+1+]</span></span><br><span class="line"><span class="comment"> *                     [2+3+]         [2+3+]         [2+3+]  [2+3+]</span></span><br><span class="line"><span class="comment"> *                     [4+5+]         [4+5+]         [4+5+]  [4+5+]</span></span><br><span class="line"><span class="comment"> *                     [6   ]         [6   ]         [6   ]  [6   ]</span></span><br><span class="line"><span class="comment"> *         Final adjustment step for non-power of two nodes</span></span><br><span class="line"><span class="comment"> *         #      0       1      2       3      4       5      6</span></span><br><span class="line"><span class="comment"> *              [0+1+] [0+1+] [0+1+]  [0+1+] [0+1+]  [0+1+] [0+1+]</span></span><br><span class="line"><span class="comment"> *              [2+3+] [2+3+] [2+3+]  [2+3+] [2+3+]  [2+3+] [2+3+]</span></span><br><span class="line"><span class="comment"> *              [4+5+] [4+5+] [4+5+]  [4+5+] [4+5+]  [4+5+] [4+5+]</span></span><br><span class="line"><span class="comment"> *              [6   ] [6   ] [6   ]  [6   ] [6   ]  [6   ] [6   ]</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line"><span class="function"><span class="type">int</span> <span class="title">ompi_coll_base_allreduce_intra_recursivedoubling</span><span class="params">(<span class="type">const</span> <span class="type">void</span> *sbuf, <span class="type">void</span> *rbuf, <span class="type">int</span> count, </span></span></span><br><span class="line"><span class="params"><span class="function">                                                    <span class="keyword">struct</span> <span class="type">ompi_datatype_t</span> *dtype, </span></span></span><br><span class="line"><span class="params"><span class="function">                                                    <span class="keyword">struct</span> <span class="type">ompi_op_t</span> *op, </span></span></span><br><span class="line"><span class="params"><span class="function">                                                    <span class="keyword">struct</span> <span class="type">ompi_communicator_t</span> *comm,</span></span></span><br><span class="line"><span class="params"><span class="function">                                                    <span class="type">mca_coll_base_module_t</span> *<span class="keyword">module</span>)</span></span></span><br><span class="line"><span class="function"></span>&#123;</span><br><span class="line">    <span class="type">int</span> size = <span class="built_in">ompi_comm_size</span>(comm);</span><br><span class="line">    <span class="type">int</span> rank = <span class="built_in">ompi_comm_rank</span>(comm);</span><br><span class="line">    </span><br><span class="line">    <span class="keyword">if</span> (<span class="number">1</span> == size) &#123;</span><br><span class="line">        <span class="keyword">if</span> (MPI_IN_PLACE != sbuf) <span class="comment">//copy data form sbuf to rbuf;</span></span><br><span class="line">        <span class="keyword">return</span> MPI_SUCCESS;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="comment">/* Determine nearest power of two less than or equal to size */</span></span><br><span class="line">    adjsize = <span class="built_in">opal_next_poweroftwo</span> (size);</span><br><span class="line">    adjsize &gt;&gt;= <span class="number">1</span>;</span><br><span class="line"></span><br><span class="line">    <span class="comment">/* Handle non-power-of-two case:</span></span><br><span class="line"><span class="comment">       - Even ranks less than 2 * extra_ranks send their data to (rank + 1), and sets new rank to -1.</span></span><br><span class="line"><span class="comment">       - Odd ranks less than 2 * extra_ranks receive data from (rank - 1), apply appropriate operation, and set new rank to rank/2</span></span><br><span class="line"><span class="comment">       - Everyone else sets rank to rank - extra_ranks</span></span><br><span class="line"><span class="comment">    */</span></span><br><span class="line"></span><br><span class="line">    extra_ranks = size - adjsize;</span><br><span class="line">    <span class="keyword">if</span> (rank &lt;  (<span class="number">2</span> * extra_ranks)) </span><br><span class="line">    &#123;</span><br><span class="line">        <span class="keyword">if</span> (<span class="number">0</span> == (rank % <span class="number">2</span>)) &#123;</span><br><span class="line">            <span class="built_in">MCA_PML_CALL</span>(<span class="built_in">send</span>(tmpsend, count, dtype, (rank + <span class="number">1</span>), ..., comm));</span><br><span class="line">            newrank = <span class="number">-1</span>;</span><br><span class="line">        &#125; <span class="keyword">else</span> &#123;</span><br><span class="line">            <span class="built_in">MCA_PML_CALL</span>(<span class="built_in">recv</span>(tmprecv, count, dtype, (rank - <span class="number">1</span>), ..., comm, ...));</span><br><span class="line">            <span class="comment">/* tmpsend = tmprecv (op) tmpsend */</span></span><br><span class="line">            <span class="built_in">ompi_op_reduce</span>(op, tmprecv, tmpsend, count, dtype);</span><br><span class="line">            newrank = rank &gt;&gt; <span class="number">1</span>;</span><br><span class="line">        &#125;</span><br><span class="line">    &#125;</span><br><span class="line">    <span class="keyword">else</span> &#123;newrank = rank - extra_ranks;&#125;</span><br><span class="line"></span><br><span class="line">    <span class="comment">/* Communication/Computation loop</span></span><br><span class="line"><span class="comment">       - Exchange message with remote node.</span></span><br><span class="line"><span class="comment">       - Perform appropriate operation taking in account order of operations:</span></span><br><span class="line"><span class="comment">       result = value (op) result</span></span><br><span class="line"><span class="comment">    */</span></span><br><span class="line">   <span class="keyword">for</span> (distance = <span class="number">0x1</span>; distance &lt; adjsize; distance &lt;&lt;=<span class="number">1</span>) &#123;</span><br><span class="line">        <span class="keyword">if</span> (newrank &lt; <span class="number">0</span>) <span class="keyword">break</span>;</span><br><span class="line">        <span class="comment">/* Determine remote node (异或操作)*/</span></span><br><span class="line">        newremote = newrank ^ distance;</span><br><span class="line">        remote = (newremote &lt; extra_ranks)? (newremote * <span class="number">2</span> + <span class="number">1</span>):(newremote + extra_ranks);</span><br><span class="line"></span><br><span class="line">        <span class="comment">/* Exchange the data */</span></span><br><span class="line">        ret = <span class="built_in">ompi_coll_base_sendrecv_actual</span>(tmpsend, count, dtype, remote, ..., tmprecv, count, dtype, remote, ..., comm, ...);</span><br><span class="line"></span><br><span class="line">        <span class="comment">/* Apply operation */</span></span><br><span class="line">        <span class="keyword">if</span> (rank &lt; remote) &#123;</span><br><span class="line">            <span class="comment">/* tmprecv = tmpsend (op) tmprecv, ompi_op_reduce(op, soruce, target, ...)*/</span></span><br><span class="line">            <span class="built_in">ompi_op_reduce</span>(op, tmpsend, tmprecv, count, dtype);</span><br><span class="line">            tmpswap = tmprecv;</span><br><span class="line">            tmprecv = tmpsend;</span><br><span class="line">            tmpsend = tmpswap;</span><br><span class="line">        &#125; <span class="keyword">else</span> &#123;</span><br><span class="line">            <span class="comment">/* tmpsend = tmprecv (op) tmpsend */</span></span><br><span class="line">            <span class="built_in">ompi_op_reduce</span>(op, tmprecv, tmpsend, count, dtype);</span><br><span class="line">        &#125;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="comment">/* Handle non-power-of-two case:</span></span><br><span class="line"><span class="comment">       - Odd ranks less than 2 * extra_ranks send result from tmpsend to (rank - 1)</span></span><br><span class="line"><span class="comment">       - Even ranks less than 2 * extra_ranks receive result from (rank + 1)</span></span><br><span class="line"><span class="comment">    */</span></span><br><span class="line">    <span class="keyword">if</span> (rank &lt; (<span class="number">2</span> * extra_ranks)) &#123;</span><br><span class="line">        <span class="keyword">if</span> (<span class="number">0</span> == (rank % <span class="number">2</span>)) &#123;</span><br><span class="line">            <span class="built_in">MCA_PML_CALL</span>(<span class="built_in">recv</span>(rbuf, count, dtype, (rank + <span class="number">1</span>), ...));</span><br><span class="line">            tmpsend = (<span class="type">char</span>*)rbuf;</span><br><span class="line">        &#125; <span class="keyword">else</span> &#123;</span><br><span class="line">            <span class="built_in">MCA_PML_CALL</span>(<span class="built_in">send</span>(tmpsend, count, dtype, (rank - <span class="number">1</span>), ...));</span><br><span class="line">        &#125;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="comment">/* Ensure that the final result is in rbuf */</span></span><br><span class="line">    <span class="keyword">if</span> (tmpsend != rbuf) <span class="comment">//copy form tmpsend to rbuf;</span></span><br><span class="line"></span><br><span class="line">    <span class="keyword">return</span> MPI_SUCCESS;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure>
<p>注意新版$OpenMPI$中的$\verb+recursive doubling+$函数可以支持不满足交换律的操作，但是必须满足结合律。</p>
<h3 id="MPICH版-1"><a href="#MPICH版-1" class="headerlink" title="MPICH版"></a>MPICH版</h3><p>实现几乎完全一样，故略。</p>
<h2 id="4、reduce-scatter算法"><a href="#4、reduce-scatter算法" class="headerlink" title="4、reduce-scatter算法"></a>4、reduce-scatter算法</h2><h3 id="OpenMPI版-2"><a href="#OpenMPI版-2" class="headerlink" title="OpenMPI版"></a>OpenMPI版</h3><figure class="highlight cpp"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br><span class="line">84</span><br><span class="line">85</span><br><span class="line">86</span><br><span class="line">87</span><br><span class="line">88</span><br><span class="line">89</span><br><span class="line">90</span><br><span class="line">91</span><br><span class="line">92</span><br><span class="line">93</span><br><span class="line">94</span><br><span class="line">95</span><br><span class="line">96</span><br><span class="line">97</span><br><span class="line">98</span><br><span class="line">99</span><br><span class="line">100</span><br><span class="line">101</span><br><span class="line">102</span><br><span class="line">103</span><br><span class="line">104</span><br><span class="line">105</span><br><span class="line">106</span><br><span class="line">107</span><br><span class="line">108</span><br><span class="line">109</span><br><span class="line">110</span><br><span class="line">111</span><br><span class="line">112</span><br><span class="line">113</span><br><span class="line">114</span><br><span class="line">115</span><br><span class="line">116</span><br><span class="line">117</span><br><span class="line">118</span><br><span class="line">119</span><br><span class="line">120</span><br><span class="line">121</span><br><span class="line">122</span><br><span class="line">123</span><br><span class="line">124</span><br><span class="line">125</span><br><span class="line">126</span><br><span class="line">127</span><br><span class="line">128</span><br><span class="line">129</span><br><span class="line">130</span><br><span class="line">131</span><br><span class="line">132</span><br><span class="line">133</span><br><span class="line">134</span><br><span class="line">135</span><br><span class="line">136</span><br><span class="line">137</span><br><span class="line">138</span><br><span class="line">139</span><br><span class="line">140</span><br><span class="line">141</span><br><span class="line">142</span><br><span class="line">143</span><br><span class="line">144</span><br><span class="line">145</span><br><span class="line">146</span><br><span class="line">147</span><br><span class="line">148</span><br><span class="line">149</span><br><span class="line">150</span><br><span class="line">151</span><br><span class="line">152</span><br><span class="line">153</span><br><span class="line">154</span><br><span class="line">155</span><br><span class="line">156</span><br><span class="line">157</span><br><span class="line">158</span><br><span class="line">159</span><br><span class="line">160</span><br><span class="line">161</span><br><span class="line">162</span><br><span class="line">163</span><br><span class="line">164</span><br><span class="line">165</span><br><span class="line">166</span><br><span class="line">167</span><br><span class="line">168</span><br><span class="line">169</span><br><span class="line">170</span><br><span class="line">171</span><br><span class="line">172</span><br><span class="line">173</span><br><span class="line">174</span><br><span class="line">175</span><br><span class="line">176</span><br><span class="line">177</span><br><span class="line">178</span><br><span class="line">179</span><br><span class="line">180</span><br><span class="line">181</span><br><span class="line">182</span><br><span class="line">183</span><br><span class="line">184</span><br><span class="line">185</span><br><span class="line">186</span><br><span class="line">187</span><br><span class="line">188</span><br><span class="line">189</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">/* </span></span><br><span class="line"><span class="comment"> * This algorithm is a combination of a reduce-scatter implemented with</span></span><br><span class="line"><span class="comment"> * recursive vector halving and recursive distance doubling, followed either</span></span><br><span class="line"><span class="comment"> * by an allgather implemented with recursive doubling.</span></span><br><span class="line"><span class="comment"> * </span></span><br><span class="line"><span class="comment"> * Limitations:</span></span><br><span class="line"><span class="comment"> *   count &gt;= 2^&#123;\floor&#123;\log_2 p&#125;&#125;</span></span><br><span class="line"><span class="comment"> *   commutative operations only</span></span><br><span class="line"><span class="comment"> *   intra-communicators only</span></span><br><span class="line"><span class="comment">*/</span></span><br><span class="line"><span class="function"><span class="type">int</span> <span class="title">ompi_coll_base_allreduce_intra_redscat_allgather</span><span class="params">(<span class="type">const</span> <span class="type">void</span> *sbuf, </span></span></span><br><span class="line"><span class="params"><span class="function">    <span class="type">void</span> *rbuf, <span class="type">int</span> count, <span class="keyword">struct</span> <span class="type">ompi_datatype_t</span> *dtype,</span></span></span><br><span class="line"><span class="params"><span class="function">    <span class="keyword">struct</span> <span class="type">ompi_op_t</span> *op, <span class="keyword">struct</span> <span class="type">ompi_communicator_t</span> *comm,</span></span></span><br><span class="line"><span class="params"><span class="function">    <span class="type">mca_coll_base_module_t</span> *<span class="keyword">module</span>)</span></span></span><br><span class="line"><span class="function"></span>&#123;</span><br><span class="line">    <span class="type">int</span> comm_size = <span class="built_in">ompi_comm_size</span>(comm);</span><br><span class="line">    <span class="type">int</span> rank = <span class="built_in">ompi_comm_rank</span>(comm);</span><br><span class="line"></span><br><span class="line">    <span class="type">int</span> nsteps = <span class="built_in">opal_hibit</span>(comm_size, comm-&gt;c_cube_dim + <span class="number">1</span>);<span class="comment">/*ilog2(comm_size)*/</span></span><br><span class="line">    <span class="built_in">assert</span>(nsteps &gt;= <span class="number">0</span>);</span><br><span class="line">    <span class="type">int</span> nprocs_pof2 = <span class="number">1</span> &lt;&lt; nsteps;                   <span class="comment">/* flp2(comm_size) */</span></span><br><span class="line"></span><br><span class="line">    <span class="type">ptrdiff_t</span> lb, extent, dsize, gap = <span class="number">0</span>;</span><br><span class="line">    <span class="built_in">ompi_datatype_get_extent</span>(dtype, &amp;lb, &amp;extent);</span><br><span class="line"></span><br><span class="line">    <span class="keyword">if</span> (count &lt; nprocs_pof2 || !<span class="built_in">ompi_op_is_commute</span>(op)) </span><br><span class="line">    &#123;</span><br><span class="line">        <span class="keyword">return</span> <span class="built_in">ompi_coll_base_allreduce_intra_basic_linear</span>(sbuf, rbuf, count, dtype, op, comm, <span class="keyword">module</span>);</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="keyword">if</span> (sbuf != MPI_IN_PLACE) <span class="comment">//copy from sbuf to rbuf;</span></span><br><span class="line"></span><br><span class="line">    <span class="comment">/*</span></span><br><span class="line"><span class="comment">     * Step 1. Reduce the number of processes to the nearest lower power of two</span></span><br><span class="line"><span class="comment">     * p&#x27; = 2^&#123;\floor&#123;\log_2 p&#125;&#125; by removing r = p - p&#x27; processes.</span></span><br><span class="line"><span class="comment">     * 1. All the even ranks send the second half of  the input vector to their right neighbor (rank + 1),</span></span><br><span class="line"><span class="comment">     *    and all the odd ranks send the first half of the input vector to their left neighbor (rank - 1).</span></span><br><span class="line"><span class="comment">     * 2. All 2r processes compute the reduction on their half.</span></span><br><span class="line"><span class="comment">     * 3. The odd ranks then send the result to their left neighbors (the even ranks).</span></span><br><span class="line"><span class="comment">     */</span></span><br><span class="line">    <span class="type">int</span> vrank, step, wsize;</span><br><span class="line">    <span class="type">int</span> nprocs_rem = comm_size - nprocs_pof2;</span><br><span class="line"></span><br><span class="line">    <span class="keyword">if</span> (rank &lt; <span class="number">2</span> * nprocs_rem)</span><br><span class="line">    &#123;</span><br><span class="line">        <span class="type">int</span> count_lhalf = count / <span class="number">2</span>;</span><br><span class="line">        <span class="type">int</span> count_rhalf = count - count_lhalf;</span><br><span class="line">        <span class="keyword">if</span> (rank % <span class="number">2</span> != <span class="number">0</span>)</span><br><span class="line">        &#123;</span><br><span class="line">            <span class="comment">/*</span></span><br><span class="line"><span class="comment">             * Odd process -- exchange with (rank - 1)</span></span><br><span class="line"><span class="comment">             * Send the left half of the input vector to the left neighbor,</span></span><br><span class="line"><span class="comment">             * Recv the right half of the input vector from the left neighbor</span></span><br><span class="line"><span class="comment">             */</span></span><br><span class="line">            <span class="built_in">ompi_coll_base_sendrecv</span>(rbuf, count_lhalf, dtype, rank - <span class="number">1</span>, ...,</span><br><span class="line">                            (<span class="type">char</span> *)tmp_buf + (<span class="type">ptrdiff_t</span>)count_lhalf * extent,</span><br><span class="line">                            count_rhalf, dtype, rank - <span class="number">1</span>, ..., rank);</span><br><span class="line"></span><br><span class="line">            <span class="comment">/* Reduce on the right half of the buffers (result in rbuf) */</span></span><br><span class="line">            <span class="built_in">ompi_op_reduce</span>(op, (<span class="type">char</span> *)tmp_buf + (<span class="type">ptrdiff_t</span>)count_lhalf * extent,</span><br><span class="line">                        (<span class="type">char</span> *)rbuf + count_lhalf * extent, count_rhalf, dtype);</span><br><span class="line"></span><br><span class="line">            <span class="comment">/* Send the right half to the left neighbor */</span></span><br><span class="line">            <span class="built_in">MCA_PML_CALL</span>(<span class="built_in">send</span>((<span class="type">char</span> *)rbuf + (<span class="type">ptrdiff_t</span>)count_lhalf * extent,</span><br><span class="line">                               count_rhalf, dtype, rank - <span class="number">1</span>, ..., comm));</span><br><span class="line"></span><br><span class="line">            <span class="comment">/* This process does not pariticipate in recursive doubling phase */</span></span><br><span class="line">            vrank = <span class="number">-1</span>;</span><br><span class="line">        &#125;</span><br><span class="line">        <span class="keyword">else</span></span><br><span class="line">        &#123;</span><br><span class="line">            <span class="comment">/*</span></span><br><span class="line"><span class="comment">             * Even process -- exchange with rank + 1</span></span><br><span class="line"><span class="comment">             * Send the right half of the input vector to the right neighbor,</span></span><br><span class="line"><span class="comment">             * Recv the left half of the input vector from the right neighbor</span></span><br><span class="line"><span class="comment">             */</span></span><br><span class="line">            <span class="built_in">ompi_coll_base_sendrecv</span>((<span class="type">char</span> *)rbuf + (<span class="type">ptrdiff_t</span>)count_lhalf*extent,</span><br><span class="line">                            count_rhalf, dtype, rank + <span class="number">1</span>, ...,tmp_buf, </span><br><span class="line">                            count_lhalf, dtype, rank + <span class="number">1</span>, ..., comm, ..., rank);</span><br><span class="line"></span><br><span class="line">            <span class="comment">/* Reduce on the left half of the buffers (result in rbuf) */</span></span><br><span class="line">            <span class="built_in">ompi_op_reduce</span>(op, tmp_buf, rbuf, count_lhalf, dtype);</span><br><span class="line"></span><br><span class="line">            <span class="comment">/* Recv the right half from the right neighbor */</span></span><br><span class="line">            <span class="built_in">MCA_PML_CALL</span>(<span class="built_in">recv</span>((<span class="type">char</span> *)rbuf + (<span class="type">ptrdiff_t</span>)count_lhalf * extent,</span><br><span class="line">                            count_rhalf, dtype, rank + <span class="number">1</span>, ..., comm, ...));</span><br><span class="line"></span><br><span class="line">            vrank = rank / <span class="number">2</span>;</span><br><span class="line">        &#125;</span><br><span class="line">    &#125;</span><br><span class="line">    <span class="keyword">else</span> </span><br><span class="line">    &#123; <span class="comment">/* rank &gt;= 2 * nprocs_rem */</span></span><br><span class="line">        vrank = rank - nprocs_rem;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="comment">/*</span></span><br><span class="line"><span class="comment">     * Step 2. Reduce-scatter implemented with recursive vector halving and</span></span><br><span class="line"><span class="comment">     * recursive distance doubling. We have  new ranks and result in rbuf.</span></span><br><span class="line"><span class="comment">     *At the end, each process has 1 / p&#x27; of the total reduction result.</span></span><br><span class="line"><span class="comment">     */</span></span><br><span class="line">    <span class="type">int</span>* rindex = <span class="built_in">malloc</span>(<span class="built_in">sizeof</span>(*rindex) * nsteps);</span><br><span class="line">    <span class="type">int</span>* sindex = <span class="built_in">malloc</span>(<span class="built_in">sizeof</span>(*sindex) * nsteps);</span><br><span class="line">    <span class="type">int</span>* rcount = <span class="built_in">malloc</span>(<span class="built_in">sizeof</span>(*rcount) * nsteps);</span><br><span class="line">    <span class="type">int</span>* scount = <span class="built_in">malloc</span>(<span class="built_in">sizeof</span>(*scount) * nsteps);</span><br><span class="line"></span><br><span class="line">    <span class="keyword">if</span> (vrank != <span class="number">-1</span>)</span><br><span class="line">    &#123;</span><br><span class="line">        step = <span class="number">0</span>;</span><br><span class="line">        wsize = count;</span><br><span class="line">        sindex[<span class="number">0</span>] = rindex[<span class="number">0</span>] = <span class="number">0</span>;</span><br><span class="line">        <span class="keyword">for</span> (<span class="type">int</span> mask = <span class="number">1</span>; mask &lt; nprocs_pof2; mask &lt;&lt;= <span class="number">1</span>)</span><br><span class="line">        &#123;</span><br><span class="line">            <span class="comment">/* On each iteration: rindex[step] = sindex[step] -- begining of the</span></span><br><span class="line"><span class="comment">            current window. Length of the current window is stored in wsize.*/</span></span><br><span class="line">            <span class="type">int</span> vdest = vrank ^ mask;</span><br><span class="line">            <span class="comment">/* Translate vdest virtual rank to real rank */</span></span><br><span class="line">            <span class="type">int</span> dest = (vdest &lt; nprocs_rem) ? vdest * <span class="number">2</span> : vdest + nprocs_rem;</span><br><span class="line">            <span class="keyword">if</span> (rank &lt; dest) &#123;</span><br><span class="line">                <span class="comment">/*</span></span><br><span class="line"><span class="comment">                 * Recv into the left half of the current window, send the right</span></span><br><span class="line"><span class="comment">                 * half of the window to the peer (perform reduce on the left</span></span><br><span class="line"><span class="comment">                 * half of the current window)</span></span><br><span class="line"><span class="comment">                 */</span></span><br><span class="line">                rcount[step] = wsize / <span class="number">2</span>;</span><br><span class="line">                scount[step] = wsize - rcount[step];</span><br><span class="line">                sindex[step] = rindex[step] + rcount[step];</span><br><span class="line">            &#125;<span class="keyword">else</span> &#123;</span><br><span class="line">                <span class="comment">/*</span></span><br><span class="line"><span class="comment">                 * Recv into the right half of the current window, send the left</span></span><br><span class="line"><span class="comment">                 * half of the window to the peer (perform reduce on the right</span></span><br><span class="line"><span class="comment">                 * half of the current window)</span></span><br><span class="line"><span class="comment">                 */</span></span><br><span class="line">                scount[step] = wsize / <span class="number">2</span>;</span><br><span class="line">                rcount[step] = wsize - scount[step];</span><br><span class="line">                rindex[step] = sindex[step] + scount[step];</span><br><span class="line">            &#125;</span><br><span class="line"></span><br><span class="line">            <span class="comment">/* Send part of data from the rbuf, recv into the tmp_buf */</span></span><br><span class="line">            <span class="built_in">ompi_coll_base_sendrecv</span>((<span class="type">char</span> *)rbuf + (<span class="type">ptrdiff_t</span>)sindex[step] * extent,</span><br><span class="line">                            scount[step], dtype, dest, ...,</span><br><span class="line">                            (<span class="type">char</span> *)tmp_buf + (<span class="type">ptrdiff_t</span>)rindex[step] * extent, </span><br><span class="line">                            rcount[step], dtype, dest, ..., comm, ..., rank);</span><br><span class="line"></span><br><span class="line">            <span class="comment">/* Local reduce: rbuf[] = tmp_buf[] &lt;op&gt; rbuf[] */</span></span><br><span class="line">            <span class="built_in">ompi_op_reduce</span>(op, (<span class="type">char</span>*)tmp_buf + (<span class="type">ptrdiff_t</span>)rindex[step] * extent,</span><br><span class="line">                           (<span class="type">char</span> *)rbuf + (<span class="type">ptrdiff_t</span>)rindex[step] * extent,</span><br><span class="line">                           rcount[step], dtype);</span><br><span class="line"></span><br><span class="line">            <span class="comment">/* Move the current window to the received message */</span></span><br><span class="line">            <span class="keyword">if</span> (step + <span class="number">1</span> &lt; nsteps) &#123;</span><br><span class="line">                rindex[step + <span class="number">1</span>] = rindex[step];</span><br><span class="line">                sindex[step + <span class="number">1</span>] = rindex[step];</span><br><span class="line">                wsize = rcount[step];</span><br><span class="line">                step++;</span><br><span class="line">            &#125;</span><br><span class="line">        &#125;</span><br><span class="line"></span><br><span class="line">        <span class="comment">/*</span></span><br><span class="line"><span class="comment">        * Step 3. Allgather by the recursive doubling algorithm.</span></span><br><span class="line"><span class="comment">        * Each process has 1 / p&#x27; of the total reduction result:</span></span><br><span class="line"><span class="comment">        * rcount[nsteps - 1] elements in the rbuf[rindex[nsteps - 1], ...].</span></span><br><span class="line"><span class="comment">        * All exchanges are executed in reverse order relative to recursive doubling (previous step).</span></span><br><span class="line"><span class="comment">        */</span></span><br><span class="line">        step = nsteps - <span class="number">1</span>;</span><br><span class="line">        <span class="keyword">for</span> (<span class="type">int</span> mask = nprocs_pof2 &gt;&gt; <span class="number">1</span>; mask &gt; <span class="number">0</span>; mask &gt;&gt;= <span class="number">1</span>)</span><br><span class="line">        &#123;</span><br><span class="line">            <span class="type">int</span> vdest = vrank ^ mask;</span><br><span class="line">            <span class="comment">/* Translate vdest virtual rank to real rank */</span></span><br><span class="line">            <span class="type">int</span> dest = (vdest &lt; nprocs_rem) ? vdest * <span class="number">2</span> : vdest + nprocs_rem;</span><br><span class="line">            <span class="comment">/* Send rcount[step] elements from rbuf[rindex[step]...]</span></span><br><span class="line"><span class="comment">               Recv scount[step] elements to rbuf[sindex[step]...] */</span></span><br><span class="line">            <span class="built_in">ompi_coll_base_sendrecv</span>((<span class="type">char</span> *)rbuf + (<span class="type">ptrdiff_t</span>)rindex[step] * extent,</span><br><span class="line">                                rcount[step], dtype, dest, ...,</span><br><span class="line">                                (<span class="type">char</span> *)rbuf + (<span class="type">ptrdiff_t</span>)sindex[step] * extent,</span><br><span class="line">                                scount[step], dtype, dest, ..., comm, ..., rank);</span><br><span class="line">            step--;</span><br><span class="line">        &#125;</span><br><span class="line">    &#125;</span><br><span class="line"></span><br><span class="line">    <span class="comment">/* Step 4. Send total result to excluded odd ranks. */</span></span><br><span class="line">    <span class="keyword">if</span> (rank &lt; <span class="number">2</span> * nprocs_rem)</span><br><span class="line">    &#123;</span><br><span class="line">        <span class="keyword">if</span>(rank % <span class="number">2</span> != <span class="number">0</span>) <span class="comment">/* Odd process -- recv result from rank - 1 */</span></span><br><span class="line">            <span class="built_in">MCA_PML_CALL</span>(<span class="built_in">recv</span>(rbuf, count, dtype, rank - <span class="number">1</span>, ..., comm, ...));</span><br><span class="line">        <span class="keyword">else</span> <span class="comment">/* Even process -- send result to rank + 1 */</span></span><br><span class="line">            <span class="built_in">MCA_PML_CALL</span>(<span class="built_in">send</span>(rbuf, count, dtype, rank + <span class="number">1</span>, ..., comm));</span><br><span class="line">    &#125;</span><br><span class="line">    <span class="comment">//free buffer rindex, sindex, rcount, scount;</span></span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure>

<h2 id="5、ring算法"><a href="#5、ring算法" class="headerlink" title="5、ring算法"></a>5、ring算法</h2><h3 id="OpenMPI版-3"><a href="#OpenMPI版-3" class="headerlink" title="OpenMPI版"></a>OpenMPI版</h3><p>注：只有在$OpenMPI$中才有$Ring$ $Allreduce$。</p>
<figure class="highlight cpp"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">/*</span></span><br><span class="line"><span class="comment"> * 消息自动被分成M/N段，算法执行2*N - 1步。算法不能保证规约操作顺序，所以仅支持满足交换律的操作，并且消息量小于进程数时算法不能正常工作。</span></span><br><span class="line"><span class="comment"> * Example on 5 nodes:</span></span><br><span class="line"><span class="comment"> *         Initial state</span></span><br><span class="line"><span class="comment"> *   #      0              1             2              3             4</span></span><br><span class="line"><span class="comment"> *        [00]           [10]          [20]           [30]           [40]</span></span><br><span class="line"><span class="comment"> *        [01]           [11]          [21]           [31]           [41]</span></span><br><span class="line"><span class="comment"> *        [02]           [12]          [22]           [32]           [42]</span></span><br><span class="line"><span class="comment"> *        [03]           [13]          [23]           [33]           [43]</span></span><br><span class="line"><span class="comment"> *        [04]           [14]          [24]           [34]           [44]</span></span><br><span class="line"><span class="comment"> *</span></span><br><span class="line"><span class="comment"> *  计算阶段</span></span><br><span class="line"><span class="comment"> *  Step 0: rank r sends block r to rank (r+1) and receives bloc (r-1)</span></span><br><span class="line"><span class="comment"> *  from rank (r-1) [with wraparound].</span></span><br><span class="line"><span class="comment"> *    #     0              1             2              3             4</span></span><br><span class="line"><span class="comment"> *        [00]          [00+10]        [20]           [30]           [40]</span></span><br><span class="line"><span class="comment"> *        [01]           [11]         [11+21]         [31]           [41]</span></span><br><span class="line"><span class="comment"> *        [02]           [12]          [22]          [22+32]         [42]</span></span><br><span class="line"><span class="comment"> *        [03]           [13]          [23]           [33]         [33+43]</span></span><br><span class="line"><span class="comment"> *      [44+04]          [14]          [24]           [34]           [44]</span></span><br><span class="line"><span class="comment"> *</span></span><br><span class="line"><span class="comment"> *  Step 1: rank r sends block (r-1) to rank (r+1) and receives bloc</span></span><br><span class="line"><span class="comment"> *  (r-2) from rank (r-1) [with wraparound].</span></span><br><span class="line"><span class="comment"> *    #      0              1             2              3             4</span></span><br><span class="line"><span class="comment"> *         [00]          [00+10]     [01+10+20]        [30]           [40]</span></span><br><span class="line"><span class="comment"> *         [01]           [11]         [11+21]      [11+21+31]        [41]</span></span><br><span class="line"><span class="comment"> *         [02]           [12]          [22]          [22+32]      [22+32+42]</span></span><br><span class="line"><span class="comment"> *      [33+43+03]        [13]          [23]           [33]         [33+43]</span></span><br><span class="line"><span class="comment"> *        [44+04]       [44+04+14]       [24]           [34]           [44]</span></span><br><span class="line"><span class="comment"> *</span></span><br><span class="line"><span class="comment"> *  Step 2: rank r sends block (r-2) to rank (r+1) and receives bloc</span></span><br><span class="line"><span class="comment"> *  (r-2) from rank (r-1) [with wraparound].</span></span><br><span class="line"><span class="comment"> *    #      0              1             2              3             4</span></span><br><span class="line"><span class="comment"> *         [00]          [00+10]     [01+10+20]    [01+10+20+30]      [40]</span></span><br><span class="line"><span class="comment"> *         [01]           [11]         [11+21]      [11+21+31]    [11+21+31+41]</span></span><br><span class="line"><span class="comment"> *     [22+32+42+02]      [12]          [22]          [22+32]      [22+32+42]</span></span><br><span class="line"><span class="comment"> *      [33+43+03]    [33+43+03+13]     [23]           [33]         [33+43]</span></span><br><span class="line"><span class="comment"> *        [44+04]       [44+04+14]  [44+04+14+24]      [34]           [44]</span></span><br><span class="line"><span class="comment"> *</span></span><br><span class="line"><span class="comment"> *  Step 3: rank r sends block (r-3) to rank (r+1) and receives bloc</span></span><br><span class="line"><span class="comment"> *   (r-3) from rank (r-1) [with wraparound].</span></span><br><span class="line"><span class="comment"> *    #      0              1             2              3             4</span></span><br><span class="line"><span class="comment"> *         [00]          [00+10]     [01+10+20]    [01+10+20+30]     [FULL]</span></span><br><span class="line"><span class="comment"> *        [FULL]           [11]        [11+21]      [11+21+31]    [11+21+31+41]</span></span><br><span class="line"><span class="comment"> *     [22+32+42+02]     [FULL]          [22]         [22+32]      [22+32+42]</span></span><br><span class="line"><span class="comment"> *      [33+43+03]    [33+43+03+13]     [FULL]          [33]         [33+43]</span></span><br><span class="line"><span class="comment"> *        [44+04]       [44+04+14]  [44+04+14+24]      [FULL]         [44]</span></span><br><span class="line"><span class="comment"> *</span></span><br><span class="line"><span class="comment"> *  分布阶段：ring ALLGATHER with ranks shifted by 1.</span></span><br><span class="line"><span class="comment"> */</span></span><br><span class="line"><span class="function"><span class="type">int</span> <span class="title">ompi_coll_base_allreduce_intra_ring</span><span class="params">(...)</span></span></span><br><span class="line"><span class="function"></span>&#123;</span><br><span class="line">    <span class="type">int</span> size = <span class="built_in">ompi_comm_size</span>(comm);</span><br><span class="line">    <span class="type">int</span> rank = <span class="built_in">ompi_comm_rank</span>(comm);</span><br><span class="line"></span><br><span class="line">    <span class="keyword">if</span> (<span class="number">1</span> == size) <span class="comment">//copy from sbuf to rbuf</span></span><br><span class="line"></span><br><span class="line">    <span class="comment">//若消息量小于进程数就使用recursive doubling</span></span><br><span class="line">    <span class="keyword">if</span> (count &lt; size) </span><br><span class="line">        <span class="keyword">return</span> <span class="built_in">ompi_coll_base_allreduce_intra_recursivedoubling</span>(sbuf, rbuf, count, dtype, op, comm, <span class="keyword">module</span>);</span><br><span class="line"></span><br><span class="line">    </span><br><span class="line">    </span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure>
      
    </div>
    <footer class="article-footer">
      <a data-url="https://chudod.gitee.io/myworks/2022/04/13/OpenMPI&MPICH_Allreduce/" data-id="cl3kznr07000vj0ukcwuf5emg" data-title="MPI_Allreduce的在OpenMPI、MPICH中的实现" class="article-share-link">Share</a>
      
      
      
  <ul class="article-tag-list" itemprop="keywords"><li class="article-tag-list-item"><a class="article-tag-list-link" href="/myworks/tags/MPI/" rel="tag">MPI</a></li><li class="article-tag-list-item"><a class="article-tag-list-link" href="/myworks/tags/%E5%B9%B6%E8%A1%8C%E8%AE%A1%E7%AE%97/" rel="tag">并行计算</a></li></ul>

    </footer>
  </div>
  
</article>



  
    <article id="post-Guiding_Board" class="h-entry article article-type-post" itemprop="blogPost" itemscope itemtype="https://schema.org/BlogPosting">
  <div class="article-meta">
    <a href="/myworks/2022/04/10/Guiding_Board/" class="article-date">
  <time class="dt-published" datetime="2022-04-09T16:00:00.000Z" itemprop="datePublished">2022-04-10</time>
</a>
    
  <div class="article-category">
    <a class="article-category-link" href="/myworks/categories/%E5%AF%BC%E8%88%AA/">导航</a>
  </div>

  </div>
  <div class="article-inner">
    
    
      <header class="article-header">
        
  
    <h1 itemprop="name">
      <a class="p-name article-title" href="/myworks/2022/04/10/Guiding_Board/">Read Me（导航栏）</a>
    </h1>
  

      </header>
    
    <div class="e-content article-entry" itemprop="articleBody">
      
        <h1 id="搭建该网站的目的"><a href="#搭建该网站的目的" class="headerlink" title="搭建该网站的目的"></a>搭建该网站的目的</h1><p>小王同学做网站的初衷有两点：第一，为了写一篇综述论文，我做的方向是并行计算和互连网络的结合，由于我自己要写的综述论文涵盖的面非常广，也深入到了MPI和互连网络的最底层，所以难度相当大，为了积累材料和做笔记，我打算自己搭建一个网站；第二，我希望自己整理的工作和思路不仅能够被组里的老师和师兄师姐看到，引起大家的讨论和思考，也希望更多的人能够了解到。随着时间的推移，网站的内容也会越来越多，越来也丰富。我不仅总结了自己读过的论文，自己的部分想法和工作，还记录了自己做过的一些实验以及计算机的一些基础知识，甚至还有一些奇怪的知识、整过的狠活、现在和曾经的生活面貌。这个世界是复杂的，是值得我们奋斗的。  </p>
<h1 id="网站内容导航"><a href="#网站内容导航" class="headerlink" title="网站内容导航"></a>网站内容导航</h1><h2 id="一、并行计算系列"><a href="#一、并行计算系列" class="headerlink" title="一、并行计算系列"></a>一、并行计算系列</h2><h3 id="1、MPI程序设计基础与并行算法"><a href="#1、MPI程序设计基础与并行算法" class="headerlink" title="1、MPI程序设计基础与并行算法"></a>1、MPI程序设计基础与并行算法</h3><p><a target="_blank" rel="noopener" href="https://synchrosky.com/2022/03/30/First_MPI_Program/">MPICH的安装和第一个Allreduce程序</a><br><a target="_blank" rel="noopener" href="https://synchrosky.com/2022/03/26/%E5%88%A9%E7%94%A8MPI%E5%AE%9E%E7%8E%B0Cannon%E7%AE%97%E6%B3%95%E5%B9%B6%E8%A1%8C%E7%9F%A9%E9%98%B5%E4%B9%98%E6%B3%95/">并行矩阵乘法：Cannon算法</a>  </p>
<h3 id="2、MPI集合通信算法"><a href="#2、MPI集合通信算法" class="headerlink" title="2、MPI集合通信算法"></a>2、MPI集合通信算法</h3><p><a target="_blank" rel="noopener" href="https://synchrosky.com/2022/03/27/MPI_Allreduce_Summary/">MPI_Allreduce的前世今生</a><br><a target="_blank" rel="noopener" href="https://synchrosky.com/2022/04/10/MPI_Collectives_Thesis/">MPI集合通信论文整理</a><br><a target="_blank" rel="noopener" href="https://synchrosky.com/2022/04/13/OpenMPI&MPICH_Allreduce/">OpenMPI和MPICH中Allreduce的实现</a><br><a target="_blank" rel="noopener" href="https://synchrosky.com/2022/03/30/Deeply_Understand_MPI/">深入理解MPI实现</a><br><a target="_blank" rel="noopener" href="https://synchrosky.com/2022/04/25/Topology_Aware_Collectives/">高性能计算网络与拓扑感知的集合通信算法</a></p>
<h3 id="3、在网计算-In-Network-集合通信"><a href="#3、在网计算-In-Network-集合通信" class="headerlink" title="3、在网计算(In-Network)集合通信"></a>3、在网计算(In-Network)集合通信</h3><p><a target="_blank" rel="noopener" href="https://synchrosky.com/2022/04/10/In-Network_Collective_Communication/">在网计算集合通信</a><br><a target="_blank" rel="noopener" href="https://synchrosky.com/2022/04/10/%E9%9B%86%E5%90%88%E9%80%9A%E4%BF%A1%E4%B9%8B%E5%9C%A8%E7%BD%91%E8%AE%A1%E7%AE%97%E8%AE%BA%E6%96%87%E6%95%B4%E7%90%86/">在网计算集合通信论文整理</a></p>
<h2 id="二、计算机基础知识系列"><a href="#二、计算机基础知识系列" class="headerlink" title="二、计算机基础知识系列"></a>二、计算机基础知识系列</h2><h3 id="1、操作系统"><a href="#1、操作系统" class="headerlink" title="1、操作系统"></a>1、操作系统</h3><p><a target="_blank" rel="noopener" href="https://synchrosky.com/2022/04/10/OS_Starting/">操作系统启动的皮毛知识</a><br><a target="_blank" rel="noopener" href="https://synchrosky.com/2022/04/14/OS_Memory_Management/">操作系统内存管理</a></p>
<h2 id="三、硬核学者系列"><a href="#三、硬核学者系列" class="headerlink" title="三、硬核学者系列"></a>三、硬核学者系列</h2><p><a target="_blank" rel="noopener" href="https://synchrosky.com/2022/04/20/TorstenHeofler/">硬核学者之Trosen Hoefler</a></p>
<h2 id="四、奇怪知识系列"><a href="#四、奇怪知识系列" class="headerlink" title="四、奇怪知识系列"></a>四、奇怪知识系列</h2><h2 id="五、整活系列"><a href="#五、整活系列" class="headerlink" title="五、整活系列"></a>五、整活系列</h2><p>1、<a target="_blank" rel="noopener" href="https://synchrosky.com/2022/03/26/xjtu/">XJTU的旧时光</a>    </p>
<p>本网站内容持续更新中…<br><img src="https://note.youdao.com/yws/api/personal/file/WEB8a52193ab1800fc09b72a30eedd0319b?method=download&shareKey=db01fa57b10b434ada861e14c93e6ff2"></p>

      
    </div>
    <footer class="article-footer">
      <a data-url="https://chudod.gitee.io/myworks/2022/04/10/Guiding_Board/" data-id="cl3kznqzx0009j0uk7kephoqp" data-title="Read Me（导航栏）" class="article-share-link">Share</a>
      
      
      
    </footer>
  </div>
  
</article>



  
    <article id="post-In-Network_Collective_Summary" class="h-entry article article-type-post" itemprop="blogPost" itemscope itemtype="https://schema.org/BlogPosting">
  <div class="article-meta">
    <a href="/myworks/2022/04/10/In-Network_Collective_Summary/" class="article-date">
  <time class="dt-published" datetime="2022-04-09T16:00:00.000Z" itemprop="datePublished">2022-04-10</time>
</a>
    
  <div class="article-category">
    <a class="article-category-link" href="/myworks/categories/%E5%B9%B6%E8%A1%8C%E8%AE%A1%E7%AE%97/">并行计算</a>►<a class="article-category-link" href="/myworks/categories/%E5%B9%B6%E8%A1%8C%E8%AE%A1%E7%AE%97/In-Network-Collectives/">In-Network Collectives</a>
  </div>

  </div>
  <div class="article-inner">
    
    
      <header class="article-header">
        
  
    <h1 itemprop="name">
      <a class="p-name article-title" href="/myworks/2022/04/10/In-Network_Collective_Summary/">In-Network Collective Communication</a>
    </h1>
  

      </header>
    
    <div class="e-content article-entry" itemprop="articleBody">
      
        
      
    </div>
    <footer class="article-footer">
      <a data-url="https://chudod.gitee.io/myworks/2022/04/10/In-Network_Collective_Summary/" data-id="cl3kznqzy000aj0uk9bmr6tgn" data-title="In-Network Collective Communication" class="article-share-link">Share</a>
      
      
      
  <ul class="article-tag-list" itemprop="keywords"><li class="article-tag-list-item"><a class="article-tag-list-link" href="/myworks/tags/MPI/" rel="tag">MPI</a></li><li class="article-tag-list-item"><a class="article-tag-list-link" href="/myworks/tags/%E5%9C%A8%E7%BD%91%E8%AE%A1%E7%AE%97/" rel="tag">在网计算</a></li><li class="article-tag-list-item"><a class="article-tag-list-link" href="/myworks/tags/%E5%B9%B6%E8%A1%8C%E8%AE%A1%E7%AE%97/" rel="tag">并行计算</a></li><li class="article-tag-list-item"><a class="article-tag-list-link" href="/myworks/tags/%E9%9B%86%E5%90%88%E9%80%9A%E4%BF%A1/" rel="tag">集合通信</a></li></ul>

    </footer>
  </div>
  
</article>



  
    <article id="post-In_Network_Collectives_Thesis" class="h-entry article article-type-post" itemprop="blogPost" itemscope itemtype="https://schema.org/BlogPosting">
  <div class="article-meta">
    <a href="/myworks/2022/04/10/In_Network_Collectives_Thesis/" class="article-date">
  <time class="dt-published" datetime="2022-04-09T16:00:00.000Z" itemprop="datePublished">2022-04-10</time>
</a>
    
  <div class="article-category">
    <a class="article-category-link" href="/myworks/categories/%E5%B9%B6%E8%A1%8C%E8%AE%A1%E7%AE%97/">并行计算</a>►<a class="article-category-link" href="/myworks/categories/%E5%B9%B6%E8%A1%8C%E8%AE%A1%E7%AE%97/In-Network-Collectives/">In-Network Collectives</a>
  </div>

  </div>
  <div class="article-inner">
    
    
      <header class="article-header">
        
  
    <h1 itemprop="name">
      <a class="p-name article-title" href="/myworks/2022/04/10/In_Network_Collectives_Thesis/">In-Network Collective Communication论文整理</a>
    </h1>
  

      </header>
    
    <div class="e-content article-entry" itemprop="articleBody">
      
        
      
    </div>
    <footer class="article-footer">
      <a data-url="https://chudod.gitee.io/myworks/2022/04/10/In_Network_Collectives_Thesis/" data-id="cl3kznr00000dj0uk8n2c3dt1" data-title="In-Network Collective Communication论文整理" class="article-share-link">Share</a>
      
      
      
  <ul class="article-tag-list" itemprop="keywords"><li class="article-tag-list-item"><a class="article-tag-list-link" href="/myworks/tags/MPI/" rel="tag">MPI</a></li><li class="article-tag-list-item"><a class="article-tag-list-link" href="/myworks/tags/%E5%9C%A8%E7%BD%91%E8%AE%A1%E7%AE%97/" rel="tag">在网计算</a></li><li class="article-tag-list-item"><a class="article-tag-list-link" href="/myworks/tags/%E5%B9%B6%E8%A1%8C%E8%AE%A1%E7%AE%97/" rel="tag">并行计算</a></li><li class="article-tag-list-item"><a class="article-tag-list-link" href="/myworks/tags/%E9%9B%86%E5%90%88%E9%80%9A%E4%BF%A1/" rel="tag">集合通信</a></li></ul>

    </footer>
  </div>
  
</article>



  
    <article id="post-MPI_Collectives_Thesis" class="h-entry article article-type-post" itemprop="blogPost" itemscope itemtype="https://schema.org/BlogPosting">
  <div class="article-meta">
    <a href="/myworks/2022/04/10/MPI_Collectives_Thesis/" class="article-date">
  <time class="dt-published" datetime="2022-04-09T16:00:00.000Z" itemprop="datePublished">2022-04-10</time>
</a>
    
  <div class="article-category">
    <a class="article-category-link" href="/myworks/categories/%E5%B9%B6%E8%A1%8C%E8%AE%A1%E7%AE%97/">并行计算</a>►<a class="article-category-link" href="/myworks/categories/%E5%B9%B6%E8%A1%8C%E8%AE%A1%E7%AE%97/MPI%E9%9B%86%E5%90%88%E9%80%9A%E4%BF%A1%E7%AE%97%E6%B3%95/">MPI集合通信算法</a>
  </div>

  </div>
  <div class="article-inner">
    
    
      <header class="article-header">
        
  
    <h1 itemprop="name">
      <a class="p-name article-title" href="/myworks/2022/04/10/MPI_Collectives_Thesis/">MPI集合通信论文整理</a>
    </h1>
  

      </header>
    
    <div class="e-content article-entry" itemprop="articleBody">
      
        <h1 id="摘要"><a href="#摘要" class="headerlink" title="摘要"></a>摘要</h1><p>本文全面列举出了MPI集合通信的相关论文，对每篇论文做出一个概述而不是详细介绍，详细的MPI集合通信算法设计可以参考：<a target="_blank" rel="noopener" href="https://synchrosky.com/2022/03/27/MPI_Allreduce%E7%9A%84%E5%89%8D%E4%B8%96%E4%BB%8A%E7%94%9F/">MPI_Allreuce的前世今生</a>。  </p>
<h1 id="MPI集合通信论文"><a href="#MPI集合通信论文" class="headerlink" title="MPI集合通信论文"></a>MPI集合通信论文</h1><h2 id="1、2005-Optimization-of-Collective-Communication-Operations-in-MPICH"><a href="#1、2005-Optimization-of-Collective-Communication-Operations-in-MPICH" class="headerlink" title="1、2005: Optimization of Collective Communication Operations in MPICH"></a><a target="_blank" rel="noopener" href="https://journals.sagepub.com/doi/10.1177/1094342005051521">1、2005: Optimization of Collective Communication Operations in MPICH</a></h2><p>经典的$MPI$集合通信综述论文，介绍了常用的集合通信操作和对应的算法：$Allgather$, $Broadcast$, $All$-$to$-$all$, $Reduce$-$Scatter$, $Reduce$和$Allreduce$，并进一步的讨论了 $MPI$<em>$Reduce$ 和$MPI$</em>$Allreduce$的优化。评估模型为：$T&#x3D;\alpha+n\beta$。  </p>
<h2 id="2、-SC17-Why-is-MPI-so-slow-Analyzing-the-fundamental-limits-in-implementing-MPI-3-1"><a href="#2、-SC17-Why-is-MPI-so-slow-Analyzing-the-fundamental-limits-in-implementing-MPI-3-1" class="headerlink" title="2、(SC17) Why is MPI so slow?: Analyzing the fundamental limits in implementing MPI-3.1"></a><a target="_blank" rel="noopener" href="https://dl.acm.org/doi/10.1145/3126908.3126963">2、(SC17) Why is MPI so slow?: Analyzing the fundamental limits in implementing MPI-3.1</a></h2><p>这篇文章主要介绍了MPI-3.1标准的主要制约因素，针对MPI-3.1标准的不足重新设计了MPI-CH4软件栈，实现对于深入理解MPI的实现很有帮助。</p>

      
    </div>
    <footer class="article-footer">
      <a data-url="https://chudod.gitee.io/myworks/2022/04/10/MPI_Collectives_Thesis/" data-id="cl3kznr02000hj0uk50a83v4k" data-title="MPI集合通信论文整理" class="article-share-link">Share</a>
      
      
      
  <ul class="article-tag-list" itemprop="keywords"><li class="article-tag-list-item"><a class="article-tag-list-link" href="/myworks/tags/MPI/" rel="tag">MPI</a></li><li class="article-tag-list-item"><a class="article-tag-list-link" href="/myworks/tags/%E5%B9%B6%E8%A1%8C%E8%AE%A1%E7%AE%97/" rel="tag">并行计算</a></li></ul>

    </footer>
  </div>
  
</article>



  
    <article id="post-OS_Starting" class="h-entry article article-type-post" itemprop="blogPost" itemscope itemtype="https://schema.org/BlogPosting">
  <div class="article-meta">
    <a href="/myworks/2022/04/10/OS_Starting/" class="article-date">
  <time class="dt-published" datetime="2022-04-09T16:00:00.000Z" itemprop="datePublished">2022-04-10</time>
</a>
    
  <div class="article-category">
    <a class="article-category-link" href="/myworks/categories/%E8%AE%A1%E7%AE%97%E6%9C%BA%E5%9F%BA%E7%A1%80%E7%9F%A5%E8%AF%86/">计算机基础知识</a>►<a class="article-category-link" href="/myworks/categories/%E8%AE%A1%E7%AE%97%E6%9C%BA%E5%9F%BA%E7%A1%80%E7%9F%A5%E8%AF%86/%E6%93%8D%E4%BD%9C%E7%B3%BB%E7%BB%9F/">操作系统</a>
  </div>

  </div>
  <div class="article-inner">
    
    
      <header class="article-header">
        
  
    <h1 itemprop="name">
      <a class="p-name article-title" href="/myworks/2022/04/10/OS_Starting/">操作系统启动过程简述</a>
    </h1>
  

      </header>
    
    <div class="e-content article-entry" itemprop="articleBody">
      
        <h1 id="一、操作系统启动生成的进程树"><a href="#一、操作系统启动生成的进程树" class="headerlink" title="一、操作系统启动生成的进程树"></a>一、操作系统启动生成的进程树</h1><p>下图是我在做实验时调试出来的操作系统生成的进程树，计算机在刚开始启动时执行的是汇编代码，然后执行的第一行C代码是</p>
<figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">start_kernel</span><br></pre></td></tr></table></figure>
<p>在执行这个函数之后，正在执行的程序成为0号进程，0号进程会先后调用</p>
<figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">__do_fork</span><br><span class="line">do_execve</span><br></pre></td></tr></table></figure>
<p>两个函数去执行$\verb+fork+$和$\verb+exec+$两个系统调用，生成1号和2号进程，1号进程就是大名鼎鼎的$init$进程，所有用户程序的根进程（包括$shell$进程和用户登录进程等等）。2号进程负责创建和维护所有内核线程（在本图中是960号及之前的进程）。最后0号进程调用</p>
<figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">cpu_dile</span><br></pre></td></tr></table></figure>
<p>函数进入休眠状态，CPU在没有活跃任务可调度时就执行它。<br><img src="https://note.youdao.com/yws/api/personal/file/WEBaa78b81deaec02c397484e8a2a5b43d4?method=download&shareKey=69113f8f61e335d1864757cf6d288929" alt="操作系统启动时生成的进程树"></p>
<h1 id="二、计算机启动过程"><a href="#二、计算机启动过程" class="headerlink" title="二、计算机启动过程"></a>二、计算机启动过程</h1><p>本科做实验时画的图，描绘的时MBR启动的图，计算机先执行BIOS，然后跳转到$\verb+0x7c00+$地址执行MBR，然后执行Loader，完成实模式向保护模式切换、页表初始化等工作，然后进入内核。<br><img src="https://note.youdao.com/yws/api/personal/file/WEB06869dacb48da066cbcdabfb88241216?method=download&shareKey=69113f8f61e335d1864757cf6d288929" alt="MBR启动过程"></p>

      
    </div>
    <footer class="article-footer">
      <a data-url="https://chudod.gitee.io/myworks/2022/04/10/OS_Starting/" data-id="cl3kznr0b0015j0uk0dzc81gy" data-title="操作系统启动过程简述" class="article-share-link">Share</a>
      
      
      
  <ul class="article-tag-list" itemprop="keywords"><li class="article-tag-list-item"><a class="article-tag-list-link" href="/myworks/tags/%E6%93%8D%E4%BD%9C%E7%B3%BB%E7%BB%9F/" rel="tag">操作系统</a></li></ul>

    </footer>
  </div>
  
</article>



  
    <article id="post-Deeply_Understand_MPI" class="h-entry article article-type-post" itemprop="blogPost" itemscope itemtype="https://schema.org/BlogPosting">
  <div class="article-meta">
    <a href="/myworks/2022/03/30/Deeply_Understand_MPI/" class="article-date">
  <time class="dt-published" datetime="2022-03-29T16:00:00.000Z" itemprop="datePublished">2022-03-30</time>
</a>
    
  <div class="article-category">
    <a class="article-category-link" href="/myworks/categories/%E5%B9%B6%E8%A1%8C%E8%AE%A1%E7%AE%97/">并行计算</a>►<a class="article-category-link" href="/myworks/categories/%E5%B9%B6%E8%A1%8C%E8%AE%A1%E7%AE%97/MPI%E8%BD%AF%E4%BB%B6%E6%A0%88%E8%AE%BE%E8%AE%A1/">MPI软件栈设计</a>
  </div>

  </div>
  <div class="article-inner">
    
    
      <header class="article-header">
        
  
    <h1 itemprop="name">
      <a class="p-name article-title" href="/myworks/2022/03/30/Deeply_Understand_MPI/">深入理解MPI实现</a>
    </h1>
  

      </header>
    
    <div class="e-content article-entry" itemprop="articleBody">
      
        <p>本篇文章讨论MPI通信栈的各个层次以及制约MPI性能的主要因素。<br><a target="_blank" rel="noopener" href="https://dl.acm.org/doi/10.1145/3126908.3126963">(SC17) Why is MPI so slow?: Analyzing the fundamental limits in implementing MPI-3.1</a><br>这篇文章主要介绍了MPI-3.1标准的主要制约因素，对于深入理解MPI的实现很有帮助。其他的优化MPI的论文都是关注于MPI的典型实现（例如在MPICH或者OpenMPI里优化），而这篇文章则关注MPI-3.1标准的优化，并提出了一个基于MPI-3.1标准的高度优化的MPI实现，工作在MPICH的环境下实现的，主要工作是重新设计了MPICH通信栈里的ch4，如下图所示。</p>
<div style="align: center">
<img src="https://note.youdao.com/yws/api/personal/file/WEB53b70dc12111821a8ae1e0f1f1319988?method=download&shareKey=0412dcb1d5f95516723864a4f1a48a13" width="70%" height="70%"/>
</div>

<p>MPICH是一个典型的多层次的软件栈。在MPI的用户接口层下面，最上层是“MPI Layer”，执行的都是机器无关代码，往下走是机器相关代码，也被叫做抽象设备层或者设备层，旧的MPICH提供两种设备：CH3（对于大多数平台）以及PAMID（IBM的蓝色基因）。CH4是本文重新设计的设备层，包括CH4 Core和Shmmods（共享内存模式通信）以及Netmods（网络互连模式通信）。一个MPI_PUT操作的控制流如下：（1）在MPI Layer层中检查参数错误、检查描述了访问进程和内存的MPI窗口对象、通信应该使用线程安全路径还是非安全路径；（2）CH4 Core检查局部性，如果目的进程就是自己，CH4 Core直接处理，如果通信发生在同一个节点上，就采用Shmmods，否则采用Netmods，控制流进入模块；（3）在Netmods或者Shmmods中分析操作，并判断是否可以直接在硬件实现还是要基于活动消息回退（没搞懂这个是在干嘛）。比如MPI_PUT操作，在Netmods模式下硬件能直接完成的操作仅限于：连续内存数据的RDMA操作，对于复杂的数据类型会回退到CH4 Core中的活动消息回退处理模块（active-message fallback）。在硬件上完成MPI_PUT操作时会将MPI级别的参数转化为网络级别的参数（比如将MPI中的目标偏移量转化为操作系统的虚拟地址），然后根据对应的MPI操作来完成一个或者多个网络操作。</p>

      
    </div>
    <footer class="article-footer">
      <a data-url="https://chudod.gitee.io/myworks/2022/03/30/Deeply_Understand_MPI/" data-id="cl3kznqzq0001j0uk51u72dvb" data-title="深入理解MPI实现" class="article-share-link">Share</a>
      
      
      
  <ul class="article-tag-list" itemprop="keywords"><li class="article-tag-list-item"><a class="article-tag-list-link" href="/myworks/tags/MPI/" rel="tag">MPI</a></li><li class="article-tag-list-item"><a class="article-tag-list-link" href="/myworks/tags/%E5%B9%B6%E8%A1%8C%E8%AE%A1%E7%AE%97/" rel="tag">并行计算</a></li></ul>

    </footer>
  </div>
  
</article>



  
    <article id="post-First_MPI_Program" class="h-entry article article-type-post" itemprop="blogPost" itemscope itemtype="https://schema.org/BlogPosting">
  <div class="article-meta">
    <a href="/myworks/2022/03/30/First_MPI_Program/" class="article-date">
  <time class="dt-published" datetime="2022-03-29T16:00:00.000Z" itemprop="datePublished">2022-03-30</time>
</a>
    
  <div class="article-category">
    <a class="article-category-link" href="/myworks/categories/%E5%B9%B6%E8%A1%8C%E8%AE%A1%E7%AE%97/">并行计算</a>►<a class="article-category-link" href="/myworks/categories/%E5%B9%B6%E8%A1%8C%E8%AE%A1%E7%AE%97/MPI%E7%A8%8B%E5%BA%8F%E8%AE%BE%E8%AE%A1/">MPI程序设计</a>
  </div>

  </div>
  <div class="article-inner">
    
    
      <header class="article-header">
        
  
    <h1 itemprop="name">
      <a class="p-name article-title" href="/myworks/2022/03/30/First_MPI_Program/">First Allreduce Application</a>
    </h1>
  

      </header>
    
    <div class="e-content article-entry" itemprop="articleBody">
      
        <h1 id="MPI安装"><a href="#MPI安装" class="headerlink" title="MPI安装"></a>MPI安装</h1><h2 id="Linux安装MPICH"><a href="#Linux安装MPICH" class="headerlink" title="Linux安装MPICH"></a>Linux安装MPICH</h2><p>$MPICH$是$MPI$的一个使用的非常广泛的库。简单的在$linux$中$sudo\ apt-get\ install$就可以安装成功，当然也可以下载$MPICH$源码，可参考<a target="_blank" rel="noopener" href="https://mpitutorial.com/tutorials/installing-mpich2/zh_cn/">MPICH安装</a></p>
<figure class="highlight cpp"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">sudo apt-get update</span><br><span class="line">sudo apt-get install mpich</span><br></pre></td></tr></table></figure>
<p>通过在程序中添加</p>
<figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#<span class="keyword">include</span><span class="string">&lt;mpi.h&gt;</span></span></span><br></pre></td></tr></table></figure>
<p>即可调用$MPICH$，可以参考<a target="_blank" rel="noopener" href="https://mpitutorial.com/tutorials/mpi-hello-world/zh_cn/">这里</a>写第一个简单的$MPI\ Hello\ World$程序。<br>$mpicc$和$mpicxx$是$MPICH$的编译器，看名字就知道分别对应$C$和$C++$程序。简单使用</p>
<figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">mpicc helloworld.c -o helloworld</span><br></pre></td></tr></table></figure>
<p>就可以编译程序，再使用</p>
<figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">mpirun -n <span class="number">4</span> helloworld</span><br></pre></td></tr></table></figure>
<p>就可以运行$mpi$程序，$mpiexec$和$mpirun$是$MPICH$提供的两个解释器，能够运行我们编写的应用程序，其中$mpirun$只能使每个进程运行同一个程序，而$mpiexec$能使不同进程运行不同的程序，如</p>
<figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">mpiexec -n <span class="number">1</span> program1 : -n <span class="number">2</span> porgram2</span><br></pre></td></tr></table></figure>
<h1 id="Windows安装MPI"><a href="#Windows安装MPI" class="headerlink" title="Windows安装MPI"></a>Windows安装MPI</h1><h2 id="下载MPICH"><a href="#下载MPICH" class="headerlink" title="下载MPICH"></a>下载MPICH</h2><p>在<a target="_blank" rel="noopener" href="https://www.mpich.org/downloads/">MPICH官网</a>下载源码包，选择$Windows$系统，然后在这里点击$http$。<img src="https://note.youdao.com/yws/api/personal/file/WEB20896ad49caab7166ef314cad41a02b6?method=download&shareKey=904f976c5859bc89c59dc0453602ff98"><br>再选择这里的安装：<br><img src="https://note.youdao.com/yws/api/personal/file/WEB7c6b5b95675ec4f841346e261f69ce27?method=download&shareKey=707ee3c6204b8e72dfaa8e511c794021"><br>把两个安装程序都装上：<br><img src="https://note.youdao.com/yws/api/personal/file/WEB6d1c0e875e7a9f657caabefebba78720?method=download&shareKey=707ee3c6204b8e72dfaa8e511c794021"><br>分别点击两个安装程序，生成如下两个文件夹（可以放在D盘，自己命名文件夹）:<br><img src="https://note.youdao.com/yws/api/personal/file/WEBf74acea7bf53fffaa844e63a7dec88c0?method=download&shareKey=707ee3c6204b8e72dfaa8e511c794021"></p>
<h2 id="在Visual-Studio中配置环境"><a href="#在Visual-Studio中配置环境" class="headerlink" title="在Visual Studio中配置环境"></a>在Visual Studio中配置环境</h2><p>在$Visual Studio$中“创建新项目”-&gt;“控制台应用”。我得项目名字叫“MPITest”，然后选择“调试”-&gt;“调试属性”。<br><img src="https://note.youdao.com/yws/api/personal/file/WEBe948b4c6c4177ceb858b5b701b2c30d5?method=download&shareKey=707ee3c6204b8e72dfaa8e511c794021"><br>选择“C&#x2F;C++”-&gt;“预处理器”-&gt;“预处理器定义”-&gt;添加“MPICH_SKIP_MPICXX”<br><img src="https://note.youdao.com/yws/api/personal/file/WEB20c39cee8383c7ff5aa58051b1813587?method=download&shareKey=707ee3c6204b8e72dfaa8e511c794021"><br>选择“代码生成”-&gt;“运行库”-&gt;“多线程调试\MTD”：<br><img src="https://note.youdao.com/yws/api/personal/file/WEB3649df78f08d5e177074bb94203e796c?method=download&shareKey=707ee3c6204b8e72dfaa8e511c794021"><br>选择“输入”-&gt;“添加附加依赖项”-&gt;添加“msmpi.lib”<br><img src="https://note.youdao.com/yws/api/personal/file/WEB40a3f1b4bf58bf1355913ab011d39dba?method=download&shareKey=707ee3c6204b8e72dfaa8e511c794021"><br>选择“VC++目录”-&gt;“包含目录”-&gt;添加“(MPI安装位置)\Microsoft SDKs\Include”；<br><img src="https://note.youdao.com/yws/api/personal/file/WEB45e6a730d39c84744f498943cede91f3?method=download&shareKey=707ee3c6204b8e72dfaa8e511c794021"><br>再选择“库目录”-&gt;添加“(MPI安装位置)\Microsoft SDKs\Lib\x64”<br><img src="https://note.youdao.com/yws/api/personal/file/WEBb24da11e6a4661787def0934cdfbe1fd?method=download&shareKey=707ee3c6204b8e72dfaa8e511c794021"><br>配置完成。</p>
<h2 id="运行方法"><a href="#运行方法" class="headerlink" title="运行方法"></a>运行方法</h2><h3 id="使用cmd终端或者Powershell终端运行"><a href="#使用cmd终端或者Powershell终端运行" class="headerlink" title="使用cmd终端或者Powershell终端运行"></a>使用cmd终端或者Powershell终端运行</h3><p>按照下一节的$Allreduce$积分程序写好代码，然后$VS$里点击运行，生成$.exe$文件，由于$VS$的默认进程数是1，只能运行进程数为1的情况。<br><img src="https://note.youdao.com/yws/api/personal/file/WEB45f2da6f8a6933d86356422c763a9d35?method=download&shareKey=707ee3c6204b8e72dfaa8e511c794021"><br>打开cmd终端，在$.exe$文件的目录下输入指令</p>
<figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">mpiexec -n <span class="number">4</span> MPITest.exe</span><br></pre></td></tr></table></figure>
<p>得到结果:<br><img src="https://note.youdao.com/yws/api/personal/file/WEB2a2c779942cc9145ed1cb903c51aa7e3?method=download&shareKey=707ee3c6204b8e72dfaa8e511c794021"></p>
<h3 id="在VS里配置命令"><a href="#在VS里配置命令" class="headerlink" title="在VS里配置命令"></a>在VS里配置命令</h3><p>复制$MPITest.exe$程序到$C$盘，然后点击“MPITest项目属性”-&gt;“调试”，配置命令行属性如图即可。注意命令参数是“mpiexec.exe”程序所在的文件夹，在最开始生成的两个安装文件包的“Microsoft MPI”里，并且$MPITest.exe$程序一定要拷贝到$C$盘！！！不然$VS$会报错找不到这个$MPITest.exe$程序，不知道为什么会有这个$bug$。<br><img src="https://note.youdao.com/yws/api/personal/file/WEBf9d84e95565cd0cf872f810f043ea6ed?method=download&shareKey=707ee3c6204b8e72dfaa8e511c794021"></p>
<h1 id="First-Allreduce-Program"><a href="#First-Allreduce-Program" class="headerlink" title="First Allreduce Program"></a>First Allreduce Program</h1><p>这里展示调用$MPI$_ $Allreduce$函数来使用梯形积分法求解$\int_{1}^{e}\ln{x}dx$。<br>$Allreduce$操作是$MPI$中最常用的集合通信操作，与之相似的是$Reduce$操作，假设有$p$个进程，每个进程都持有一个含$n$个元素的向量，所有的$p$个进程将自己的向量发送给根进程，根进程收集这些向量计算规约的结果（求和、求最大最小值等等），$Reduce$操作结果保存在根进程，$Allreduce$则将根进程的结果再广播出去。简单的在应用程序中调用$MPI$_$Allreduce$就可以完成上述例程，函数定义如下：<br>程序可以表示为：</p>
<figure class="highlight cpp"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="type">int</span> <span class="title">MPI_Allreduce</span><span class="params">(</span></span></span><br><span class="line"><span class="params"><span class="function">            <span class="type">const</span> <span class="type">void</span> *sendbuf,      <span class="comment">//存放源数据</span></span></span></span><br><span class="line"><span class="params"><span class="function">            <span class="type">void</span> *recvbuf,            <span class="comment">//存放规约结果</span></span></span></span><br><span class="line"><span class="params"><span class="function">            <span class="type">int</span> count,                <span class="comment">//数据个数</span></span></span></span><br><span class="line"><span class="params"><span class="function">            MPI_Datatype datatype,    <span class="comment">//数据类型</span></span></span></span><br><span class="line"><span class="params"><span class="function">            MPI_Op op,                <span class="comment">//规约操作类型</span></span></span></span><br><span class="line"><span class="params"><span class="function">            MPI_Comm comm)</span></span>;           <span class="comment">//一组通信进程</span></span><br></pre></td></tr></table></figure>
<p>$MPI$<em>$Reduce$ 和$MPI$</em>$Allreduce$例程的图解如下所示：</p>
<div style="align: center">
<img src="https://note.youdao.com/yws/api/personal/file/WEB635e35297acaeac7ad5f79fc97142fab?method=download&shareKey=0412dcb1d5f95516723864a4f1a48a13" width="70%" height="70%"/>
</div>

<div style="align: center">
<img src="https://note.youdao.com/yws/api/personal/file/WEB8453078f33cba2dcd2ba5b49d8b04daf?method=download&shareKey=0412dcb1d5f95516723864a4f1a48a13" width="70%" height="70%"/>
</div>

<p>图片源网址和$MPI$_$Allreduce$的入门教程在这里：<a target="_blank" rel="noopener" href="https://mpitutorial.com/tutorials/mpi-reduce-and-allreduce/">英文版</a>和<a target="_blank" rel="noopener" href="https://mpitutorial.com/tutorials/mpi-reduce-and-allreduce/zh_cn/">中文版</a> </p>
<p>利用梯形积分法求解$\int_{1}^{e}\ln{x}dx$的思路是将区间$[1,e]$分成1024个段，一个段对应一个狭小的梯形，面积为$\ln{x_i}\cdot{dx}$，然后将这些梯形相加。$MPI$可以创建多个进程来加速，每个进程计算一部分区间的积分，然后将结果相加起来。在程序中$a$和$b$是整个积分的区间，$local$_ $a$和$local$_$b$是每个进程负责的区间。</p>
<figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#<span class="keyword">include</span><span class="string">&lt;stdio.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="keyword">include</span><span class="string">&lt;math.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="keyword">include</span><span class="string">&lt;mpi.h&gt;</span></span></span><br><span class="line"></span><br><span class="line"><span class="meta">#<span class="keyword">define</span>	   e	exp(1)</span></span><br><span class="line"></span><br><span class="line"><span class="type">int</span>  n = <span class="number">1024</span>, local_n;</span><br><span class="line"><span class="type">double</span>  a = <span class="number">1.0</span>, b = e, len;</span><br><span class="line"><span class="type">double</span>  local_a, local_b, sum = <span class="number">0.0</span>, ans, local_ans;</span><br><span class="line"></span><br><span class="line"><span class="type">double</span>  <span class="title function_">calc</span><span class="params">(<span class="type">double</span>  local_a, <span class="type">double</span>  local_b, <span class="type">int</span>  local_n)</span></span><br><span class="line">&#123;</span><br><span class="line">	<span class="type">double</span> sum = (<span class="built_in">log</span>(local_a) + <span class="built_in">log</span>(local_b)) / <span class="number">2</span>;</span><br><span class="line">	<span class="type">int</span> i;</span><br><span class="line">	<span class="type">double</span> xi;</span><br><span class="line">	<span class="keyword">for</span> (i = <span class="number">1</span>; i &lt; local_n; i++)</span><br><span class="line">	&#123;</span><br><span class="line">		xi = local_a + len * i;</span><br><span class="line">		sum += <span class="built_in">fabs</span>(<span class="built_in">log</span>(xi));</span><br><span class="line">	&#125;</span><br><span class="line">	<span class="keyword">return</span>  sum * len;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="type">int</span> <span class="title function_">main</span><span class="params">()</span></span><br><span class="line">&#123;</span><br><span class="line">	<span class="type">int</span>  comm_size;</span><br><span class="line">	<span class="type">int</span>  my_rank;</span><br><span class="line">	MPI_Init(<span class="literal">NULL</span>, <span class="literal">NULL</span>);</span><br><span class="line">	MPI_Comm_size(MPI_COMM_WORLD, &amp;comm_size);</span><br><span class="line">	MPI_Comm_rank(MPI_COMM_WORLD, &amp;my_rank);</span><br><span class="line"></span><br><span class="line">	len = (b - a) / n;		<span class="comment">//就是dx</span></span><br><span class="line">	local_a = a + my_rank * len * (n / comm_size);		<span class="comment">//local_a是每个进程负责计算的区间的起始</span></span><br><span class="line">	local_b = local_a + len * (n / comm_size);		<span class="comment">//local_b是每个进程负责计算的区间的结束</span></span><br><span class="line">	local_n = n / comm_size;</span><br><span class="line"></span><br><span class="line">	<span class="comment">//每个进程计算自己区间上的积分</span></span><br><span class="line">	local_ans = calc(local_a, local_b, local_n);	</span><br><span class="line">	<span class="comment">//每个进程计算的结果local_ans相加到全局结果ans上	</span></span><br><span class="line">	MPI_Allreduce(&amp;local_ans, &amp;ans, <span class="number">1</span>, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);</span><br><span class="line"></span><br><span class="line">	<span class="built_in">printf</span>(<span class="string">&quot;My rank: %d, Result : %.5f\n&quot;</span>, my_rank, ans);</span><br><span class="line">	MPI_Finalize();</span><br><span class="line"></span><br><span class="line">	<span class="keyword">return</span> <span class="number">0</span>;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure>
<p>在$Linux$中编译时链接上$lm$数学运算库：</p>
<figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">mpicc allreduce.c -o allreduce -lm</span><br><span class="line">mpiexec -n <span class="number">4</span> allreduce</span><br></pre></td></tr></table></figure>
<p>即可得到结果。</p>

      
    </div>
    <footer class="article-footer">
      <a data-url="https://chudod.gitee.io/myworks/2022/03/30/First_MPI_Program/" data-id="cl3kznqzv0006j0uk4e5u205v" data-title="First Allreduce Application" class="article-share-link">Share</a>
      
      
      
  <ul class="article-tag-list" itemprop="keywords"><li class="article-tag-list-item"><a class="article-tag-list-link" href="/myworks/tags/MPI/" rel="tag">MPI</a></li><li class="article-tag-list-item"><a class="article-tag-list-link" href="/myworks/tags/%E5%B9%B6%E8%A1%8C%E8%AE%A1%E7%AE%97/" rel="tag">并行计算</a></li></ul>

    </footer>
  </div>
  
</article>



  
    <article id="post-MPI_Allreduce_Summary" class="h-entry article article-type-post" itemprop="blogPost" itemscope itemtype="https://schema.org/BlogPosting">
  <div class="article-meta">
    <a href="/myworks/2022/03/27/MPI_Allreduce_Summary/" class="article-date">
  <time class="dt-published" datetime="2022-03-26T16:00:00.000Z" itemprop="datePublished">2022-03-27</time>
</a>
    
  <div class="article-category">
    <a class="article-category-link" href="/myworks/categories/%E5%B9%B6%E8%A1%8C%E8%AE%A1%E7%AE%97/">并行计算</a>►<a class="article-category-link" href="/myworks/categories/%E5%B9%B6%E8%A1%8C%E8%AE%A1%E7%AE%97/MPI%E9%9B%86%E5%90%88%E9%80%9A%E4%BF%A1%E7%AE%97%E6%B3%95/">MPI集合通信算法</a>
  </div>

  </div>
  <div class="article-inner">
    
    
      <header class="article-header">
        
  
    <h1 itemprop="name">
      <a class="p-name article-title" href="/myworks/2022/03/27/MPI_Allreduce_Summary/">MPI_Allreduce的前世今生</a>
    </h1>
  

      </header>
    
    <div class="e-content article-entry" itemprop="articleBody">
      
        <h1 id="摘要"><a href="#摘要" class="headerlink" title="摘要"></a>摘要</h1><p>本文介绍$Allreduce$操作在$MPI$中的算法设计和实现，归纳和整理其发展脉络，追溯到最新的研究进展，并分析$MPI$的典型实现$MPICH$库中是怎样实现$Allreduce$例程的。</p>
<h1 id="Allreduce介绍"><a href="#Allreduce介绍" class="headerlink" title="Allreduce介绍"></a>Allreduce介绍</h1><p>$Allreduce$操作是$MPI$中最常用的集合通信操作，与之相似的是$Reduce$操作，假设有$p$个进程，每个进程都持有一个含$n$个元素的向量，所有的$p$个进程将自己的向量发送给根进程，根进程收集这些向量计算规约的结果（求和、求最大最小值等等），$Reduce$操作结果保存在根进程，$Allreduce$则将根进程的结果再广播出去。简单的在应用程序中调用$\verb+MPI_Allreduce+$就可以完成上述例程，函数定义如下：<br>程序可以表示为：</p>
<figure class="highlight cpp"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"><span class="function"><span class="type">int</span> <span class="title">MPI_Allreduce</span><span class="params">(</span></span></span><br><span class="line"><span class="params"><span class="function">            <span class="type">const</span> <span class="type">void</span> *sendbuf,      <span class="comment">//存放源数据</span></span></span></span><br><span class="line"><span class="params"><span class="function">            <span class="type">void</span> *recvbuf,            <span class="comment">//存放规约结果</span></span></span></span><br><span class="line"><span class="params"><span class="function">            <span class="type">int</span> count,                <span class="comment">//数据个数</span></span></span></span><br><span class="line"><span class="params"><span class="function">            MPI_Datatype datatype,    <span class="comment">//数据类型</span></span></span></span><br><span class="line"><span class="params"><span class="function">            MPI_Op op,                <span class="comment">//规约操作类型</span></span></span></span><br><span class="line"><span class="params"><span class="function">            MPI_Comm comm)</span></span>;           <span class="comment">//一组通信进程</span></span><br></pre></td></tr></table></figure>
<p>$\verb+MPI_Reduce+$和$\verb+MPI_Allreduce+$例程的图解如下所示：</p>
<div style="align: center">
<img src="https://note.youdao.com/yws/api/personal/file/WEB635e35297acaeac7ad5f79fc97142fab?method=download&shareKey=0412dcb1d5f95516723864a4f1a48a13" width="70%" height="70%"/>
</div>

<div style="align: center">
<img src="https://note.youdao.com/yws/api/personal/file/WEB8453078f33cba2dcd2ba5b49d8b04daf?method=download&shareKey=0412dcb1d5f95516723864a4f1a48a13" width="70%" height="70%"/>
</div>

<p>图片源网址和$\verb+MPI_Allreduce+$的入门教程在这里：<a target="_blank" rel="noopener" href="https://mpitutorial.com/tutorials/mpi-reduce-and-allreduce/">英文版</a>和<a target="_blank" rel="noopener" href="https://mpitutorial.com/tutorials/mpi-reduce-and-allreduce/zh_cn/">中文版</a>  </p>
<p>$\verb+MPI_Allreduce+$广泛用于各种并行与分布式应用程序：科学计算、大数据、分布式机器学习、分布式深度学习DNN等等，并且<a target="_blank" rel="noopener" href="https://ieeexplore.ieee.org/document/8665758">已有工作</a>表明$\verb+MPI_Allreduce+$是使用频率和运行时间最长的集合通信操作。</p>
<p>自从$MPI$标准在$1994$年提出以来，$\verb+MPI_Allreduce+$的相关研究从上世纪90年代就已经有很多了，而本文从2005年的一篇综述论文出发，一直追溯到现在，总结$\verb+MPI_Allreduce+$的算法激情燃烧的昨天、老骥伏枥的今天和仰望星空过后的明天。  </p>
<h1 id="经典数据结构与算法"><a href="#经典数据结构与算法" class="headerlink" title="经典数据结构与算法"></a>经典数据结构与算法</h1><h2 id="0、评估模型：-T-x3D-alpha-n-beta"><a href="#0、评估模型：-T-x3D-alpha-n-beta" class="headerlink" title="0、评估模型：$T&#x3D;\alpha+n\beta$"></a>0、评估模型：$T&#x3D;\alpha+n\beta$</h2><p><a target="_blank" rel="noopener" href="https://journals.sagepub.com/doi/10.1177/1094342005051521">2005:Optimization of Collective Communication Operations in MPICH</a>是一篇经典的$MPI$集合通信论文，介绍了常用的集合通信操作和对应的算法：$\verb+Allgather+$, $\verb+Broadcast+$, $\verb+All-to-all+$, $\verb+Reduce-Scatter+$, $\verb+Reduce+$和$\verb+Allreduce+$，并进一步的讨论了$\verb+MPI_Reduce+$和$\verb+MPI_Allreduce+$的优化。我有一门选修课的作业就是将这篇论文全文翻译：<a target="_blank" rel="noopener" href="https://note.youdao.com/ynoteshare/index.html?id=1b5cc29dea0aff44f92bdb69a2763e79&type=notebook&_time=1648432853246#/WEB26278ae64dfc1a52e0ece0dd615075f3">翻译原文</a>。<br>这篇文章采用的性能评估模型是$T&#x3D;\alpha+n\beta$，任意两个节点之间发送一条消息的时间可以用$T&#x3D;\alpha+n\beta$，其中$\alpha$表示延迟（或者说启动时间），与消息大小无关，而$\beta$表示每个字节的传输时间，$n$表示传输消息的字节数，对于规约操作则用$\gamma$表示每个字节执行规约操作的时间消耗。由于$MPI$基于分布式存储系统，采用$LogP$模型（$L:latency$, $o:overhead$, $g:gap$, $P:processor$），一条长度为$n$的消息传输时间可以计算为:<br>$$T&#x3D;L+2o+(n-1)b,其中b&#x3D;min(o, g)$$<br>也就是传输延迟$L$加上两端处理器的处理开销$2o$，和$n$个字节的传输时间，注意$overhead$是任意一个消息的处理开销，$gap$是连续两个字节的传输时间间隔。将上述式子变形，可以得到：<br>$$T&#x3D;(L+2o-b)+nb&#x3D;\alpha+n\beta$$<br>这也就是本文性能评估模型的公式，许多文献和教材中也用到了这个公式。我们将$\alpha$称为延迟项$(latency\ term)$，而$n\beta$称为带宽项$(bandwidth\ term)$，并用这个公式来形式化的评估集合通信算法的性能。<br>下面深入讨论这篇论文介绍的集合通信算法。  </p>
<h2 id="1、-Allgather-操作及算法"><a href="#1、-Allgather-操作及算法" class="headerlink" title="1、$Allgather$操作及算法"></a>1、$Allgather$操作及算法</h2><p>$Allgather$操作的图解如下:  </p>
<div style="align: center">
<img src="https://note.youdao.com/yws/api/personal/file/WEB5c825f77e456600e07ef80d2b87f1e17?method=download&shareKey=0412dcb1d5f95516723864a4f1a48a13"/>
</div>  

<p>$\verb+MPI_Allgather+$函数可以参考<a target="_blank" rel="noopener" href="https://mpitutorial.com/tutorials/mpi-scatter-gather-and-allgather/">MPI_Allgather</a>。 </p>
<h3 id="Ring-Algorithm"><a href="#Ring-Algorithm" class="headerlink" title="Ring Algorithm"></a>Ring Algorithm</h3><p>$MPICH$中最初实现$Allgather$操作使用的就是$Ring$算法。</p>
<div style="align: center">
<img src="https://note.youdao.com/yws/api/personal/file/WEBb157d10d3be479fb109ae83144da86e9?method=download&shareKey=0412dcb1d5f95516723864a4f1a48a13" width="40%" height="40%"/>
</div> 

<p>每一个进程$i$发送本地数据给进程$(i+1)%p$，并且接受来自$(i-1)%p$的数据（环绕方式）。以后每一步，进程$i$都向进程$(i+1)%p$发送上一步接收的来自$(i-1)%p$号进程的数据。假设有$p$个进程，该算法总共需要$p-1$步来完成。用$n$表示收集的数据总量，每一步每个进程都发送$\frac{n}{p}$的数据，因此算法的时间消耗可以计算为：<br>$$T_{Ring}&#x3D;(p-1)\alpha+\frac{p-1}{p}n\beta$$    </p>
<h3 id="Recursive-Doubling"><a href="#Recursive-Doubling" class="headerlink" title="Recursive Doubling"></a>Recursive Doubling</h3><p>递归加倍算法的流程如下图所示：</p>
<div style="align: center">
<img src="https://note.youdao.com/yws/api/personal/file/WEB0cedb39776a1eb420c04698b214a371e?method=download&shareKey=0412dcb1d5f95516723864a4f1a48a13" width="50%" height="50%"/>
</div> 

<p>在第一步，彼此间距离为1的进程之间互相交换数据，数据量为$\frac{n}{p}$；第二步，彼此间距离为2的进程之间交换进程自己以及上一步从邻居进程接受的数据，数据量为$\frac{2n}{p}$；在第三步，彼此间距离为4的进程之间交换进程自己以及前两步从其他进程接受的数据，数据量为$\frac{4n}{p}$，以此类推，所有进程会在$lgP$步获得所有数据，执行时间为：<br>$$T_{rec_dbl}&#x3D;lgP\cdot\alpha+\frac{p-1}{p}n\beta$$<br>其中带宽项和环算法相同，这是因为：<br>$$\frac{n}{p}\beta+\frac{2n}{p}\beta+…+\frac{2^{lgP-1}n}{p}\beta&#x3D;\frac{p-1}{p}n\beta$$<br>这个等式的内在逻辑是任意一个进程总要接受来自其他$p-1$进程发送的总共$(p-1)\cdot\frac{n}{p}$数据量，也就是说带宽项是不能进一步减少的，但是延迟项可以通过优化算法来减少。递归加倍算法能很好的处理进程数量为2的整数幂的情况，但较难处理进程数量非2的幂次的情况。 </p>
<h3 id="Bruck-Algorithm"><a href="#Bruck-Algorithm" class="headerlink" title="Bruck Algorithm"></a>Bruck Algorithm</h3><p>$Bruck$算法能够很好的处理进程数非2的幂次的情况，算法的执行步骤为$\lceil{lgP}\rceil$步，算法图解如下所示</p>
<div style="align: center">
<img src="https://note.youdao.com/yws/api/personal/file/WEBe58d78d1a74b305f1da143695784f911?method=download&shareKey=0412dcb1d5f95516723864a4f1a48a13"/>
</div> 

<p>每个进程都有一片大小为$n$的缓存存放数据，在算法的开始，每个进程将本地数据拷贝到缓存的顶部。在第$k$步，进程$i$向目标进程$(i-2^k)%p$发送本地的所有数据，并将接受的数据（来自进程$(i+2^k)%p$）添加至本地数据的末尾，一共$\lfloor{lgP}\rfloor$步。如果进程的数量不是2的幂，还需要额外的一步，每个进程向目标进程$(i-2^k)%p$发送自己缓存头部的$p-2^{\lfloor{lgP}\rfloor}$块数据（前面步骤都是本地全部数据，这里是头部的部分数据），并将接受的数据添加到本地缓存末尾。<br>现在，所有进程都已经获得了全部数据，但是数据并不是以正确的顺序排列在缓存中：进程$i$中的所有数据块都向上偏移了$i$块。因此简单的将所有数据块循环向下移动$i$块就能将数据块调整到正确的位置上。算法的时间开销为：<br>$$T_{Bruck}&#x3D;\lceil{lgP}\rceil\cdot\alpha+\frac{p-1}{p}n\beta$$<br>$Allgather$操作的算法选取策略是：</p>
<ul>
<li>当进程数量为2的幂并且发送短消息或者中等规模消息，采用$Recursive\ doubling$算法；</li>
<li>当发送短消息以及进程数量非2的幂的情况下，采用$Bruck$算法；</li>
<li>发送大消息，无论进程数量是多少，并且进程数量非2幂且发送中等规模消息，采用$Ring$算法。</li>
</ul>
<h2 id="2、-Broadcast-操作及算法"><a href="#2、-Broadcast-操作及算法" class="headerlink" title="2、$Broadcast$操作及算法"></a>2、$Broadcast$操作及算法</h2><p>广播操作由根进程将根进程中的数据广播给所有进程，对应的是$\verb+MPI_Bcast+$函数，可以参考<a target="_blank" rel="noopener" href="https://mpitutorial.com/tutorials/mpi-broadcast-and-collective-communication/">MPI_Bcast</a></p>
<div style="align: center">
<img src="https://note.youdao.com/yws/api/personal/file/WEBa1d3f66081e6d3666838b6ff37186727?method=download&shareKey=0412dcb1d5f95516723864a4f1a48a13"/>
</div> 

<h3 id="Bionomial-Tree"><a href="#Bionomial-Tree" class="headerlink" title="Bionomial Tree"></a>Bionomial Tree</h3><p>$MPICH$中广播操作最初使用二项树算法。在第一步，根进程$root$向目标进程$(root+\frac{p}{2})%p$发送数据，进程$(root+\frac{p}{2})%p$以及根进程成为它们子树的根结点，继续递归执行算法。该算法一共执行$\lceil{\lg{p}}\rceil$步，在每一步所有进程发送的数据量均为$n$，因此算法的时间开销为：<br>$$T_{tree}&#x3D;\lceil{\lg{p}}\rceil\cdot(\alpha+n\beta)$$</p>
<h3 id="Scatter-Allgather"><a href="#Scatter-Allgather" class="headerlink" title="Scatter + Allgather"></a>Scatter + Allgather</h3><p>这是一种组合算法，又叫$Van\ de\ Geijn$算法，将$Scatter$和$Allgather$两个操作组合成了$Broadcast$操作。$Scatter$（散播）操作与$Broadcast$操作的对比如下:</p>
<div style="align: center">
<img src="https://note.youdao.com/yws/api/personal/file/WEBc635d7af2913d2a188bcd426165b25b5?method=download&shareKey=0412dcb1d5f95516723864a4f1a48a13"/>
</div> 

<p>在该算法中，要广播的数据先分成若干份，散播到各个进程中，接着，散播的数据又收集到所有进程中，也就是再执行$\verb+MPI_Allgather+$操作。其中Scatter操作使用二项树算法，时间消耗为：<br>$$T_{Scatter}&#x3D;\lg{p}\cdot\alpha+\frac{p-1}{p}n\beta$$<br>时间消耗和$Allgather$递归加倍算法相同，仔细观察你会发现两者互为逆过程。而$Allgather$操作可以使用递归加倍算法或者环算法，总时间等于两者之和。<br>因此广播操作的二项树算法和$Scatter+Allgather$算法的时间消耗对比如下：<br>$$\left{<br>\begin{matrix}<br> T_{tree}&#x3D;\lceil{\lg{p}}\rceil\cdot(\alpha+n\beta) \<br> {T_{Scatter+Allgather}&#x3D;(\lg{p}+p-1)\alpha+2\frac{p-1}{p}n\beta}<br>\end{matrix}<br>\right.<br>$$<br>对比两个式子我们可以很容易得到：</p>
<ul>
<li>当消息两较小（即$n$较小）或者进程数量少时（小于8），我们使用二项树算法；</li>
<li>当消息较大时或者进程数量较大时，我们采用$Scatter$+$Allgather$的组合算法。</li>
</ul>
<h2 id="3、-Reduce-Scatter-操作及其算法"><a href="#3、-Reduce-Scatter-操作及其算法" class="headerlink" title="3、$Reduce-Scatter$操作及其算法"></a>3、$Reduce-Scatter$操作及其算法</h2><p>$Reduce-Scatter$操作（多对多规约）是数据规约操作$Reduce$的一个变种，$Reduce$操作的结果保存在根进程中，而$Reduce-Scatter$将结果散发（$Scatter$）给所有进程。</p>
<h3 id="二项树Reduce-线性Scatter"><a href="#二项树Reduce-线性Scatter" class="headerlink" title="二项树Reduce+线性Scatter"></a>二项树Reduce+线性Scatter</h3><p>在$MPICH$中的老算法中，$Reduce-Scatter$操作先是将所有进程的数据通过二项树规约到$0$号进程，然后通过线性的散发操作将数据分发出去。二项树规约操作的时间为:<br>$$\lg{p}\cdot(\alpha+n\beta+n\gamma)$$<br>线性散发操作的时间为：<br>$$(p-1)\alpha+(p-1)\cdot\frac{n}{p}\beta$$<br>总的时间为:<br>$$T_{old}&#x3D;(\lg{p}+p-1)\alpha+(\lg{p}+\frac{p-1}{p})n\beta+\lg{p}\cdot{n}\gamma$$</p>
<h3 id="Recursive-Halving"><a href="#Recursive-Halving" class="headerlink" title="Recursive Halving"></a>Recursive Halving</h3><p>递归减半算法和前面$Allgather$操作的递归加倍算法互为逆过程。</p>
<div style="align: center">
<img src="https://note.youdao.com/yws/api/personal/file/WEB45d2ca1b81056825d285e44a1ec85910?method=download&shareKey=0412dcb1d5f95516723864a4f1a48a13"/>
</div> 

<p>在第1步，进程分为2个子集，每一个进程都和与自己间隔$\frac{p}{2}$的进程交换数据：每一个进程发送另一半集合所有进程都所需要的数据，并且接收自己所在进程集合都需要的数据，然后对收集到的数据进行规约操作。在第2步，每一个进程都和与自己间隔$\frac{p}{4}$的进程交换数据。该过程如此递归进行下去，每一步通信数据也递归减半，进行$\lg{p}$步。算法的时间消耗为：<br>$$T_{rec_halv}&#x3D;\lg{p}\cdot\alpha+\frac{p-1}{p}(n\beta+n\gamma)$$<br>该算法能够正确执行的前提是规约操作是满足交换率的（$commutative$），满足交换律的规约操作使用频率更高，这是由于$\verb+MPI+$定义的许多规约操作都是可交换的，例如$\verb+MPI_SUM+$，$\verb+MPI_MAX+$。如果进程的数量不是2的幂次，我们首先将进程的数量减少到2的幂次，具体做法是最开始的$x$个偶数编号进程发送数据给最近的奇数编号进程（$rank+1$），使得$p-x$为$2$的幂。奇数编号进程对收集的数据执行规约操作，然后这些奇数号进程和其余的$p-2x$个进程（一共$p-x$个）参与递归减半算法中计算自己的结果，最后，前$x$个奇数进程将结果返回给左邻居结点。这样算法的时间为：<br>$$T_{rec_halv}&#x3D;(\lfloor{\lg{p}\rfloor+2)}\alpha+2n\beta+n(1+\frac{p-1}{p})\gamma$$</p>
<h3 id="Pairwise-Exchange"><a href="#Pairwise-Exchange" class="headerlink" title="Pairwise Exchange"></a>Pairwise Exchange</h3><p>成对交换算法适用于规约操作不满足交换律，其思想类似于$Allgather$操作递归加倍算法。在第$1$步，每一对邻居进程交换数据；第$2$步，彼此间距为$2$的进程交换数据；在第$3$步，彼此间距为$4$的进程交换数据，如此进行下去。然而它相较于$Allgather$操作，交换的数据更多。在第一步，进程交换除了自己所需要数据以外的所有数据（$n-\frac{n}{p}$数据量），比方说0号进程把除了块0之外的$1{\sim}(p-1)$块发送给$1$号进程，1号进程发送除块1之外的$0$、$2{\sim}(p-1)$块发送给$0$号进程；第二步，进程交换除了自己和上一步通信进程所拥有数据以外的所有数据（$n-\frac{2n}{p}$）；第三步数据量为（$n-\frac{4n}{p}$）。这样算法执行的时间为：<br>$$T_{short}&#x3D;\lg{p}\cdot\alpha+(\lg{p}-\frac{p-1}{p})(n\beta+n\alpha)$$<br>该算法适用于传输的消息量小于256B的情况。对长消息发送（满足交换律的操作是$\geqslant256KB$，不满足交换律的操作是$\geqslant256B$），我们使用执行$p-1$步的成对交换算法。在第$i$步，每一个进程向$(rank+i)%p$发送数据，接收来自进程$(rank-i)%p$的数据，并执行局部规约操作。交换的数据仅仅是用于散发结果的的数据量$\frac{n}{p}$，也就是只需要发送每个进程需要的那一部分数据即可。算法执行需要的时间为：<br>$$T_{long}&#x3D;(p-1)\alpha+\frac{p-1}{p}(n\beta+n\gamma)$$<br>Tips:</p>
<ul>
<li>Commutative Operations(满足交换律的操作)：MPI定义的数据归约操作包含$\verb+sum+$、$\verb+min+$、$\verb+max+$、$\verb+MinLoc+$、$\verb+MaxLoc+$、$\verb+(bitwise)OR+$、$\verb+AND+$、$\verb+XOR+$等等，其中有些是满足交换律的，有些不满足，</li>
<li>Associative Operation(满足结合律的操作)：浮点加法和乘法满足交换律但不满足结合律，因为$(a+b)+c\neq$$a+(b+c)$，例如$10^{20}-(10^{20}+\epsilon)&#x3D;0$而$10^{20}-10^{20}-\epsilon&#x3D;-\epsilon$。</li>
<li>$Recursive$ $Havling$适合于满足交换律的操作，$Recursive$ $Doubling$只适用于满足结合律的操作。</li>
</ul>
<p>$\verb+MPI_Reduce_scatter+$策略：</p>
<ul>
<li>当操作满足交换律，消息$&lt;{256KB}$采用递归减半算法，$\geqslant{256KB}$则采用$(p-1)$步的成对交换算法；</li>
<li>操作不满足交换律，消息$&lt;256B$时采用$\lg{p}$步的成对交换算法，$\geqslant{256B}$时采用$(p-1)$步的成对交换算法。</li>
</ul>
<h2 id="4、-Reduce-操作及其算法"><a href="#4、-Reduce-操作及其算法" class="headerlink" title="4、$Reduce$操作及其算法"></a>4、$Reduce$操作及其算法</h2><h3 id="Bionomial-Tree-1"><a href="#Bionomial-Tree-1" class="headerlink" title="Bionomial Tree"></a>Bionomial Tree</h3><p>$MPICH$中老算法采用二项树算法，执行$\lg{p}$步，每步都交换一个进程的所有$n$字节数据并进行规约计算，算法的时间为：<br>$$T_{tree}&#x3D;\lceil{\lg{p}}\rceil(\alpha+n\beta+n\gamma)$$</p>
<h3 id="Reduce-scatter-Gather组合算法"><a href="#Reduce-scatter-Gather组合算法" class="headerlink" title="Reduce_scatter+Gather组合算法"></a>Reduce_scatter+Gather组合算法</h3><p>该算法将$Reduce-scatter$和$Gather$两个操作组合成$Reduce$操作，也叫$Rabenseifner$算法。回顾广播操作的$Scatter+Allgather$组合算法，成功将二项树算法的$\lg{p}\cdot{n\beta}$的带宽项减小到了$2{n\beta}$的数量级，$Reduce$操作类似于广播的逆过程，因此也可以采用类似的思想，$Reduce-scatter$和$Gather$组合的$Reduce$算法也可以将二项树算法的$\lg{p}\cdot{n\beta}$的带宽项减小到了$2{n\beta}$的数量级。算法的时间为$Reduce-scatter$（递归减半算法）和$Gather$（二项树算法）操作的总和，计算为:<br>$$T_{raben}&#x3D;2\lg{p}\cdot\alpha+\frac{p-1}{p}(2n\beta+n\gamma)$$<br>策略：</p>
<ul>
<li>当消息量小（$&lt;2KB$）时，采用二项树算法；</li>
<li>当消息量大（$\geqslant2KB$）时，采用$Reduce-scatter$+$Gather$算法。</li>
</ul>
<p>小结：看到这里也应该能摸索出一些规律，大消息发送时（$n$较大）我们要尽量较少带宽项，也就是减少$n\beta$前面的系数，延迟项（也叫启动时间）$\alpha$大一点无所谓；而小消息（$n$较小）发送时我们要尽量较少延迟项。这也就是$Reduce$和$Broadcast$操作选择不同算法时的核心思想。</p>
<h3 id="Ring-Algorithm-1"><a href="#Ring-Algorithm-1" class="headerlink" title="Ring Algorithm"></a>Ring Algorithm</h3><p>和下面将要介绍的$Ring Allreduce$类似，使用$Reduce-scatter$+$Gather$的方式，但是$Reduce-scatter$只发送一部分数据（$\frac{n}{p}$）给目标进程，且$Gather$阶段使用环算法。</p>
<h2 id="5、-Allreduce-操作及其算法"><a href="#5、-Allreduce-操作及其算法" class="headerlink" title="5、$Allreduce$操作及其算法"></a>5、$Allreduce$操作及其算法</h2><h3 id="Reduce-Broadcast"><a href="#Reduce-Broadcast" class="headerlink" title="Reduce+Broadcast"></a>Reduce+Broadcast</h3><p>$MPICH$中老算法先将结果$Reduce$到根进程然后再将根进程的结果$Broadcast$到所有进程中。</p>
<h3 id="Recursive-Doubling-1"><a href="#Recursive-Doubling-1" class="headerlink" title="Recursive Doubling"></a>Recursive Doubling</h3><p>$Allreduce$的递归加倍算法和$Allgather$的递归加倍算法是非常相似的，只是每一步都伴随规约操作且交换的数据量也不同，每次进程间两两交换的数据量都是$n$。因此算法执行的时间为：<br>$$T_{rec_dbl}&#x3D;\lg{p}(\alpha+n\beta+n\gamma)$$</p>
<h3 id="Reduce-scatter-Allgather"><a href="#Reduce-scatter-Allgather" class="headerlink" title="Reduce_scatter+Allgather"></a>Reduce_scatter+Allgather</h3><p>该算法也叫$Rabenseifner$算法。回顾$Reduce$操作我们采用了$Reduce-scatter$+$Gather$算法，这里我们在第二步将$Gather$换成了$Allgather$操作，采用$Reduce-scatter$+$Allgather$算法。算法的总开销为：<br>$$T_{raben}&#x3D;2\lg{p}\cdot\alpha+\frac{p-1}{p}(2n\beta+n\gamma)$$<br>截至目前，上述$Reduce$操作的$Reduce-scatter$+$Gather$算法和$Allgather$操作的$Reduce-scatter$+$Allgather$算法，当进程的数量不是2的幂次的时候需要额外处理。移除$r&#x3D;p-p^{‘}$个额外进程来将进程数量减少到最接近的2次幂$(p^{‘}&#x3D;2^{\lfloor{\lg{p}}\rfloor})$。前$2r$个进程（$0$号到$2r-1$号）中，所有的偶数进程将输入向量的后半部分发送给右邻居（$rank+1$），所有的奇数进程将输入向量的前半部分发送给左邻居（$rank-1$）。随后偶数进程对前半部分向量进行规约操作，奇数进程对后半部分向量进行规约操作。奇数进程将规约结果发送给左邻居进程。该步骤结束后，前$2r$个进程的偶数编号进程都拥有了和右邻居进程进行规约的结果，而奇数编号进程不会参与算法的后续过程，这样我们就可以把进程的数量减少到2的幂次：最开始的r个偶数进程和最后面的$p-2r$个进程从$0{\sim}(p^{‘}-1)$编号，是2的幂次。然后这些进程再执行$Reduce-scatter$+$Allgather$算法。最后规约的结果还要发送给第一步就已经移除的$r$个进程，如果是$Reduce$操作，根进程在第一步中就被剔除掉了，那么在$Reduce-scatter$操作之后的第一步，该根进程和邻居进程就要互换位置，这样不会增加额外消耗。<br>下图展示了带有13个进程的$Allreduce$操作算法的例子。输入向量和规约结果被分成了8个部分($A,B,…,H$)，因为$8$是小于且最接近$13$的$2$的幂次，用$A{-}H_{rank}$来表示。前2r个（$r&#x3D;13-8&#x3D;5,2r&#x3D;10$）偶数进程先执行两两$Allreduce$操作，然后前$r$个偶数进程（$0$、$2$、$4$、$6$、$8$号）和剩余的$3$（$p-2r&#x3D;3$）个进程（$10$、$11$、$12$号）执行$Reduce-scatter+Allgather$，最后，前$r$个偶数进程（$0$、$2$、$4$、$6$、$8$号）将结果发送给前$r$个奇数进程（$1$、$3$、$5$、$7$、$9$号）.<br><img src="https://note.youdao.com/yws/api/personal/file/WEBca42cd1c5c39305518f9b87e91cbcdb4?method=download&shareKey=0412dcb1d5f95516723864a4f1a48a13"></p>
<h3 id="Bionary-Block-Algorithm"><a href="#Bionary-Block-Algorithm" class="headerlink" title="Bionary Block Algorithm"></a>Bionary Block Algorithm</h3><p>叫做二方块算法。该算法能够降低进程数量非2幂时$Reduce-scatter+Allgather$算法的负载不均衡问题。以下图为例，在初始阶段对进程划分为若干块，使每一个块内进程子集的数量为2的方幂。每个块内部执行$Reduce-scatter$操作。然后，从最小的块开始，被划分为若干段做为更高一块的输入，更高的一块对收集过来的数据执行规约操作，如$\boxed{2^0}$块作为$\boxed{2^2}$块的输入，两个块进行规约，注意$\boxed{2^0}$块的4个数据拷贝与$\boxed{2^2}$块规约，然后$\boxed{2^2}$块做为$\boxed{2^3}$块的输入再进行两个块的规约。小的进程块会造成负载不均衡。两个连续块的最大差异，会决定负载不均衡的程度。定义$\delta_{expo,max}$做为两个连续块数量（均为2的幂次）的最大差值，如$100&#x3D;2^6+2^5+2^2$，则$\delta_{expo,max}&#x3D;max(6-5,5-2)&#x3D;3$，如果$\delta_{expo,max}$值很小，那么算法的性能会很好。<br>在算法的第二阶段，是$Allgather$操作。上一个更大的块必须向小块发送数据，如图。</p>
<p><img src="https://note.youdao.com/yws/api/personal/file/WEB493ba0c9657ed9064532a87b08ff8ece?method=download&shareKey=0412dcb1d5f95516723864a4f1a48a13"></p>
<h3 id="Ring-Algorithm-2"><a href="#Ring-Algorithm-2" class="headerlink" title="Ring Algorithm"></a>Ring Algorithm</h3><p>环算法，其实就是$Reduce-scatter+Allgather$算法的变形，在$Reduce-scatter$阶段是各个进程直接将一部分数据发送到其目的节点，并且$Allgather$操作使用环算法来执行。<br>借用$\verb+OpenMPI+$中的例子来解释$Ring\ Allreduce$算法。<br>假设有5个进程，则进程的输入数据分成5份，先进行$Computation\ stage$（也就是$Reduce-scatter$）然后是$Distribution\ phase$（也就是$Allgather$的环算法）。</p>
<div style="align: center">
<img src="https://note.youdao.com/yws/api/personal/file/WEBf2832a760d9e4937101fa93ae4410649?method=download&shareKey=0412dcb1d5f95516723864a4f1a48a13" width="70%" height="70%"/>
</div> 

<div style="align: center">
<img src="https://note.youdao.com/yws/api/personal/file/WEB43f5e95a1868d89cbd317169ea87df53?method=download&shareKey=0412dcb1d5f95516723864a4f1a48a13" width="70%" height="70%"/>
</div> 

<div style="align: center">
<img src="https://note.youdao.com/yws/api/personal/file/WEB169bdb1ca360203ba6ff2fdee5ddae24?method=download&shareKey=0412dcb1d5f95516723864a4f1a48a13" width="70%" height="70%"/>
</div> 

<p>该算法的执行时间为：<br>$$T_{ring}&#x3D;2(p-1)\alpha+\frac{p-1}{p}(2n\beta+n\gamma)$$</p>
<h2 id="Allreduce选择最佳算法"><a href="#Allreduce选择最佳算法" class="headerlink" title="Allreduce选择最佳算法"></a>Allreduce选择最佳算法</h2><p>在上面介绍的$Allreduce$的$5$种算法中根据进程数量和消息大小来选择不同算法，这张图是展示不同进程数量和消息大小对应的最佳算法（对$\verb+MPI_DOUBLE+$型的数据进行$\verb+MPI_SUM+$求和操作）。$havling+doubling$就是$Reduce-scatter+Allgather$算法。</p>
<div style="align: center">
<img src="https://note.youdao.com/yws/api/personal/file/WEBcc34716f4289d3c74ac7e0bd65786af4?method=download&shareKey=0412dcb1d5f95516723864a4f1a48a13" width="70%" height="70%"/>
</div> 

<p>这个是消息大小为$32KB$时对$\verb+MPI_DOUBLE+$型的数据进行$\verb+MPI_SUM+$求和操作不同算法的带宽。</p>
<div style="align: center">
<img src="https://note.youdao.com/yws/api/personal/file/WEBa4ff21d874a476bc6a4110496d8d7bbc?method=download&shareKey=0412dcb1d5f95516723864a4f1a48a13" width="70%" height="70%"/>
</div> 

<p>策略：</p>
<ul>
<li>对于短消息，使用$Recursive Doubling$算法;</li>
<li>对于长消息，先进行$Reduce-scatter$（$Recursive-halving$算法），再进行$Allgather$（$Recursive\ Doubling$算法）。</li>
</ul>
<p>后面的三种方法只是进一步优化，没有在$\verb+MPICH+$里集成。</p>
<h1 id="Allreduce算法的评估和对比"><a href="#Allreduce算法的评估和对比" class="headerlink" title="Allreduce算法的评估和对比"></a>Allreduce算法的评估和对比</h1><p>常用的经典$Allreduce$算法的消耗评估模型：</p>
<table>
<thead>
<tr>
<th align="left">Allreduce Algorithm</th>
<th align="left">Cost Model</th>
<th align="left">Efficient Bandwidth</th>
</tr>
</thead>
<tbody><tr>
<td align="left">Reduce + Broadcast</td>
<td align="left">$2\lceil{\lg{p}}\rceil(\alpha+n\beta+n\gamma)$</td>
<td align="left">$\frac{B}{2}$</td>
</tr>
<tr>
<td align="left">Recursive Doubling</td>
<td align="left">$\lg{p}(\alpha+n\beta+n\gamma)$</td>
<td align="left">…</td>
</tr>
<tr>
<td align="left">Reduce-Scatter + Allgather</td>
<td align="left">$2\lg{p}\cdot\alpha+\frac{p-1}{p}(2n\beta+n\gamma)$</td>
<td align="left">…</td>
</tr>
<tr>
<td align="left">Binary Block</td>
<td align="left">…</td>
<td align="left">…</td>
</tr>
<tr>
<td align="left">Ring Algorithm</td>
<td align="left">$2(p-1)\alpha+\frac{p-1}{p}(2n\beta+n\gamma)$</td>
<td align="left">$\frac{nB}{2(n-1)B}$</td>
</tr>
</tbody></table>
<h1 id="OpenMPI和MPICH中的Allreduce算法"><a href="#OpenMPI和MPICH中的Allreduce算法" class="headerlink" title="OpenMPI和MPICH中的Allreduce算法"></a>OpenMPI和MPICH中的Allreduce算法</h1><h2 id="1、OpenMPI-4-1-2的MPI-Allreduce实现"><a href="#1、OpenMPI-4-1-2的MPI-Allreduce实现" class="headerlink" title="1、OpenMPI-4.1.2的MPI_Allreduce实现"></a>1、OpenMPI-4.1.2的MPI_Allreduce实现</h2><p>$\verb+OpenMPI-4.1.2+$是最新版本的$\verb+OpenMPI+$，算法的具体选择在$\verb+ompi&#x2F;mca&#x2F;coll&#x2F;tuned&#x2F;coll_tuned_decision_fixed.c+$和$\verb+ompi&#x2F;mca&#x2F;coll&#x2F;tuned&#x2F;coll_tuned_decision_dynamic.c+$文件里，用户可以指定规则以及选择使用的算法，并且$\verb+OpenMPI+$使用了6种算法，分别是</p>
<figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line">Algorithms:</span><br><span class="line">   &#123;<span class="number">1</span>, <span class="string">&quot;basic_linear&quot;</span>&#125;: Reduce + Broadcast</span><br><span class="line">   &#123;<span class="number">2</span>, <span class="string">&quot;nonoverlapping&quot;</span>&#125;: Reduce +Broadcast</span><br><span class="line">   &#123;<span class="number">3</span>, <span class="string">&quot;recursive_doubling&quot;</span>&#125;: Recursive Doubling</span><br><span class="line">   &#123;<span class="number">4</span>, <span class="string">&quot;ring&quot;</span>&#125;: Ring(Segmented Messages) + Allgather(Ring)</span><br><span class="line">   &#123;<span class="number">5</span>, <span class="string">&quot;segmented_ring&quot;</span>&#125;: Segmented Ring</span><br><span class="line">   &#123;<span class="number">6</span>, <span class="string">&quot;rabenseifner&quot;</span>&#125;: Reduce-Scatter + Allgather</span><br><span class="line">   <span class="comment">/* Currently, ring, segmented ring, and rabenseifner do not support non-commutative operations. */</span></span><br></pre></td></tr></table></figure>
<p>默认使用$\verb+&#x2F;coll_tuned_decision_fixed.c+$里的规则（固定算法选择规则），具体的选择方法如下(原代码是100多行的$else-if$，贼暴力)：<br><img src="https://note.youdao.com/yws/api/personal/file/WEBef75c5797a8fe9a7d3afe7d9f84db3a2?method=download&shareKey=0412dcb1d5f95516723864a4f1a48a13"><br><img src="https://note.youdao.com/yws/api/personal/file/WEBe38b3f100dd6599ec413faa1ee25edcf?method=download&shareKey=0412dcb1d5f95516723864a4f1a48a13"><br>除了默认的规则之外，用户还可以指定参数来选择对应的算法。<br>函数选择逻辑：</p>
<figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">//动态算法选择规则</span></span><br><span class="line"><span class="type">int</span> <span class="title function_">ompi_coll_tuned_allreduce_intra_dec_dynamic</span><span class="params">()</span></span><br><span class="line">&#123;</span><br><span class="line">    ...</span><br><span class="line">    <span class="comment">//如果指定了filebased rules(暂不知道这是啥);</span></span><br><span class="line">    <span class="keyword">if</span> (tuned_module-&gt;com_rules[ALLREDUCE])</span><br><span class="line">    &#123;</span><br><span class="line">        ...</span><br><span class="line">        <span class="type">int</span> algorithm = ompi_coll_tuned_get_target_method_params();</span><br><span class="line">        <span class="keyword">if</span>(algorithm) <span class="keyword">return</span> ompi_coll_tuned_allreduce_intra_do_this(..., algorithm, ...)</span><br><span class="line">        &#123;</span><br><span class="line">            ...</span><br><span class="line">            <span class="keyword">switch</span> (algorithm) &#123;</span><br><span class="line">                <span class="keyword">case</span> (<span class="number">0</span>): <span class="keyword">return</span> ompi_coll_tuned_allreduce_intra_dec_fixed();</span><br><span class="line">                <span class="keyword">case</span> (<span class="number">1</span>): <span class="keyword">return</span> ompi_coll_base_allreduce_intra_basic_linear();</span><br><span class="line">                <span class="keyword">case</span> (<span class="number">2</span>): <span class="keyword">return</span> ompi_coll_base_allreduce_intra_nonoverlapping();</span><br><span class="line">                <span class="keyword">case</span> (<span class="number">3</span>): <span class="keyword">return</span> ompi_coll_base_allreduce_intra_recursivedoubling();</span><br><span class="line">                <span class="keyword">case</span> (<span class="number">4</span>): <span class="keyword">return</span> ompi_coll_base_allreduce_intra_ring();</span><br><span class="line">                <span class="keyword">case</span> (<span class="number">5</span>): <span class="keyword">return</span> ompi_coll_base_allreduce_intra_ring_segmented();</span><br><span class="line">                <span class="keyword">case</span> (<span class="number">6</span>): <span class="keyword">return</span> ompi_coll_base_allreduce_intra_redscat_allgather();</span><br><span class="line">            &#125;</span><br><span class="line">            ...</span><br><span class="line">        &#125;</span><br><span class="line">        ...</span><br><span class="line">    &#125;</span><br><span class="line">    <span class="comment">//如果用户指定了算法;</span></span><br><span class="line">    <span class="keyword">if</span> (tuned_module-&gt;user_forced[ALLREDUCE].algorithm)</span><br><span class="line">    &#123;</span><br><span class="line">        ...</span><br><span class="line">        <span class="keyword">return</span> ompi_coll_tuned_allreduce_intra_do_this(..., algorithm, ...);</span><br><span class="line">    &#125;</span><br><span class="line">    <span class="comment">//若用户没指定算法，则使用固定规则</span></span><br><span class="line">    <span class="keyword">else</span> <span class="keyword">return</span> ompi_coll_tuned_allreduce_intra_dec_fixed(...)</span><br><span class="line">    &#123;</span><br><span class="line">        ...</span><br><span class="line">        <span class="type">int</span> ompi_coll_tuned_allreduce_intra_dec_fixed (...)</span><br><span class="line">        &#123;</span><br><span class="line">            <span class="comment">//100多行的if-else, 根据进程数量和消息量确定algorithm(从1~6选一个值)</span></span><br><span class="line">            <span class="keyword">return</span> ompi_coll_tuned_allreduce_intra_do_this (..., algorithm, ...);</span><br><span class="line">        &#125;;</span><br><span class="line">    &#125;</span><br><span class="line">    ...</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure>

<h2 id="2、MPICH-4-0-2的MPI-Allreduce实现"><a href="#2、MPICH-4-0-2的MPI-Allreduce实现" class="headerlink" title="2、MPICH-4.0.2的MPI_Allreduce实现"></a>2、MPICH-4.0.2的MPI_Allreduce实现</h2><p>$MPI$应用程序在调用$\verb+MPI_Allreduce+$时执行的主要算法、主要函数以及选择逻辑如下：</p>
<figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">/*</span></span><br><span class="line"><span class="comment">    Algorithm:</span></span><br><span class="line"><span class="comment">        Intra Communication:</span></span><br><span class="line"><span class="comment">            Recursive Doubling;</span></span><br><span class="line"><span class="comment">            Reduce-scatter + Allgather</span></span><br><span class="line"><span class="comment">            Nb(Nonblocking Allreduce + Wait)</span></span><br><span class="line"><span class="comment">            Smp(Local Reduce + Bcast)</span></span><br><span class="line"><span class="comment">        Inter Communication:</span></span><br><span class="line"><span class="comment">            Reduce-exchange + Bcast</span></span><br><span class="line"><span class="comment">            Nb(Nonblocking Allreduce + Wait)</span></span><br><span class="line"><span class="comment">*/</span></span><br><span class="line"><span class="type">int</span> <span class="title function_">MPI_Allreduce</span><span class="params">(<span class="type">const</span> <span class="type">void</span> *sendbuf, <span class="type">void</span> *recvbuf, <span class="type">int</span> count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm)</span></span><br><span class="line">&#123;</span><br><span class="line">    ...</span><br><span class="line">    MPIR_Allreduce(sendbuf, recvbuf, count, datatype, op, comm_ptr, &amp;errflag)</span><br><span class="line">    &#123;</span><br><span class="line">        ...</span><br><span class="line">        MPIR_Allreduce_impl(sendbuf, recvbuf, count, datatype, op, comm_ptr, &amp;errflag)</span><br><span class="line">        &#123;</span><br><span class="line">            <span class="keyword">if</span>(<span class="comment">/*intra communicator*/</span>) </span><br><span class="line">            &#123;</span><br><span class="line">                <span class="keyword">switch</span>(intra_alrogithm)</span><br><span class="line">                &#123;</span><br><span class="line">                    <span class="keyword">case</span> intra_recursive_doubling:</span><br><span class="line">                        MPIR_Allreduce_intra_recursive_doubling(...);</span><br><span class="line">                        <span class="keyword">break</span>;</span><br><span class="line">                    <span class="keyword">case</span> intra_reduce_scatter_allgather:</span><br><span class="line">                        MPIR_Allreduce_intra_reduce_scatter_allgather(...);</span><br><span class="line">                        <span class="keyword">break</span>;</span><br><span class="line">                    <span class="keyword">case</span> nb:</span><br><span class="line">                        MPIR_Allreduce_allcomm_nb(...);</span><br><span class="line">                        <span class="keyword">break</span>;</span><br><span class="line">                    <span class="keyword">case</span> intra_smp:</span><br><span class="line">                        MPIR_Allreduce_intra_smp(...);</span><br><span class="line">                        <span class="keyword">break</span>;</span><br><span class="line">                    <span class="keyword">case</span> <span class="keyword">auto</span>:</span><br><span class="line">                        MPIR_Allreduce_allcomm_auto(...)</span><br><span class="line">                        &#123;</span><br><span class="line">                            MPII_Csel_container_s *cnt = MPIR_Csel_search(comm_ptr-&gt;csel_comm, coll_sig);</span><br><span class="line">                            <span class="keyword">switch</span> (cnt-&gt;id)</span><br><span class="line">                            &#123;</span><br><span class="line">                                <span class="keyword">case</span> intra_recursive_doubling:</span><br><span class="line">                                    MPIR_Allreduce_intra_recursive_doubling(...);</span><br><span class="line">                                    <span class="keyword">break</span>;</span><br><span class="line">                                <span class="keyword">case</span> intra_reduce_scatter_allgather:</span><br><span class="line">                                    MPIR_Allreduce_intra_reduce_scatter_allgather(...);</span><br><span class="line">                                    <span class="keyword">break</span>;</span><br><span class="line">                                <span class="keyword">case</span> intra_smp:</span><br><span class="line">                                    MPIR_Allreduce_intra_smp(...);</span><br><span class="line">                                    <span class="keyword">break</span>;</span><br><span class="line">                                <span class="keyword">case</span> inter_reduce_exchange_bcast:</span><br><span class="line">                                    MPIR_Allreduce_inter_reduce_exchange_bcast(...);</span><br><span class="line">                                    <span class="keyword">break</span>;</span><br><span class="line">                                <span class="keyword">case</span> nb:</span><br><span class="line">                                    MPIR_Allreduce_allcomm_nb(...)</span><br><span class="line">                                    &#123;</span><br><span class="line">                                        MPIR_Iallreduce(...);       <span class="comment">/*Nonblocking Allreduce*/</span></span><br><span class="line">                                    &#125;</span><br><span class="line">                                    <span class="keyword">break</span>;</span><br><span class="line">                            &#125;</span><br><span class="line">                        &#125;</span><br><span class="line">                &#125;</span><br><span class="line">            &#125;</span><br><span class="line">            <span class="keyword">else</span>    <span class="comment">/*inter communicator*/</span></span><br><span class="line">            &#123;</span><br><span class="line">                <span class="keyword">switch</span>(inter_algorithm)</span><br><span class="line">                &#123;</span><br><span class="line">                    <span class="keyword">case</span> inter_reduce_exchange_bcast:</span><br><span class="line">                        MPIR_Allreduce_inter_reduce_exchange_bcast(...);</span><br><span class="line">                        <span class="keyword">break</span>;</span><br><span class="line">                    <span class="keyword">case</span> nb:</span><br><span class="line">                        MPIR_Allreduce_allcomm_nb(...);</span><br><span class="line">                        <span class="keyword">break</span>;</span><br><span class="line">                    <span class="keyword">case</span> <span class="keyword">auto</span>:</span><br><span class="line">                        MPIR_Allreduce_allcomm_auto(...);</span><br><span class="line">                        <span class="keyword">break</span>;</span><br><span class="line">                &#125;</span><br><span class="line">            &#125;</span><br><span class="line">        &#125;</span><br><span class="line">        ...</span><br><span class="line">    &#125;</span><br><span class="line">    ...</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure>
      
    </div>
    <footer class="article-footer">
      <a data-url="https://chudod.gitee.io/myworks/2022/03/27/MPI_Allreduce_Summary/" data-id="cl3kznr00000ej0uk3b672zop" data-title="MPI_Allreduce的前世今生" class="article-share-link">Share</a>
      
      
      
  <ul class="article-tag-list" itemprop="keywords"><li class="article-tag-list-item"><a class="article-tag-list-link" href="/myworks/tags/MPI/" rel="tag">MPI</a></li><li class="article-tag-list-item"><a class="article-tag-list-link" href="/myworks/tags/%E5%B9%B6%E8%A1%8C%E8%AE%A1%E7%AE%97/" rel="tag">并行计算</a></li></ul>

    </footer>
  </div>
  
</article>



  
    <article id="post-Cannon" class="h-entry article article-type-post" itemprop="blogPost" itemscope itemtype="https://schema.org/BlogPosting">
  <div class="article-meta">
    <a href="/myworks/2022/03/26/Cannon/" class="article-date">
  <time class="dt-published" datetime="2022-03-25T16:00:00.000Z" itemprop="datePublished">2022-03-26</time>
</a>
    
  <div class="article-category">
    <a class="article-category-link" href="/myworks/categories/%E5%B9%B6%E8%A1%8C%E8%AE%A1%E7%AE%97/">并行计算</a>►<a class="article-category-link" href="/myworks/categories/%E5%B9%B6%E8%A1%8C%E8%AE%A1%E7%AE%97/MPI%E7%A8%8B%E5%BA%8F%E8%AE%BE%E8%AE%A1%E4%B8%8E%E5%B9%B6%E8%A1%8C%E7%AE%97%E6%B3%95/">MPI程序设计与并行算法</a>
  </div>

  </div>
  <div class="article-inner">
    
    
      <header class="article-header">
        
  
    <h1 itemprop="name">
      <a class="p-name article-title" href="/myworks/2022/03/26/Cannon/">Cannon矩阵乘法</a>
    </h1>
  

      </header>
    
    <div class="e-content article-entry" itemprop="articleBody">
      
        <p>$Cannon$算法是并行矩阵乘法的经典算法，将多个处理器排列成二维网格，采用二维块划分的方法将矩阵分给不同的处理器计算各自的局部结果，以此来加速计算。在本文中，为方便起见，示例程序中的矩阵均为$n$阶方阵，处理器的数量为2的幂次，确保每个矩阵得到的局部矩阵的元素个数相同。</p>
<h1 id="一、二维矩阵串行乘法"><a href="#一、二维矩阵串行乘法" class="headerlink" title="一、二维矩阵串行乘法"></a>一、二维矩阵串行乘法</h1><p>两个$n$维方阵的乘法$A \cdot B &#x3D; C$的串行计算公式为：<br>$$ C_{ij} &#x3D; \sum_{k&#x3D;0}^{n-1} A_{ik} \cdot B_{kj},. $$<br>程序可以表示为：</p>
<figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"><span class="type">void</span> <span class="title function_">Multiply</span><span class="params">(<span class="type">double</span>* A, <span class="type">double</span>* B, <span class="type">double</span>* C, <span class="type">int</span> n)</span></span><br><span class="line">&#123;</span><br><span class="line">	<span class="keyword">for</span> (<span class="type">int</span> i = <span class="number">0</span>; i &lt; n; i++)</span><br><span class="line">		<span class="keyword">for</span> (<span class="type">int</span> j = <span class="number">0</span>; j &lt; n; j++)</span><br><span class="line">			<span class="keyword">for</span> (<span class="type">int</span> k = <span class="number">0</span>; k &lt; n; k++)</span><br><span class="line">				C[i * n + j] += A[i * n + k] * B[j + k * n];</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure>
<p>程序将二维矩阵用一维矩阵线性展开，用一维数组来模拟二维数组。</p>
<h1 id="二、Cannon算法"><a href="#二、Cannon算法" class="headerlink" title="二、Cannon算法"></a>二、Cannon算法</h1><p>并行化二维矩阵乘法的一种思想是二维块划分方法，将$p,$ 个进程（$p,$为完全平方数）排列成$\sqrt[]{p}\times\sqrt[]{p},$二维网格，然后将矩阵$A、B、C,$都分成$\sqrt[]{p}\times\sqrt[]{p},$ 块，均匀分布在网格上，矩阵第$(i,j),$个子块分别记为$A_{ij},$、$B_{ij},$、$C_{ij},$，存储在二维进程网格上坐标为$(i,j),$的进程$P_{ij},$上。计算$C_{ij},$时要将$A_{ik},$(第$i,$行进程上的所有$A,$的子块)和$B_{kj},$(第$j,$列进程上的所有$B,$的子块)都收集到进程$P_{ij},$上，再计算$C_{ij},$，公式可以表达为：<br>$$C_{ij} &#x3D; \sum_{k&#x3D;0}^{\sqrt[]{p}-1} A_{ik} \cdot B_{kj}$$<br>如下图所示：<br><img src="https://img-blog.csdnimg.cn/d00915586a8d405c973cfd0903f821b7.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBA5qWa5Zu95Luk5bC5,size_30,color_FFFFFF,t_70,g_se,x_16#pic_center" alt="二维块划分"></p>
<p>然而每一个进程都重复收集$A_{ik},$和$B_{kj},$会造成空间的大量浪费，时间效率也比较低，于是有了更高效的$Canon,$算法，其表达式为：<br>$$C_{ij} &#x3D; \sum_{k&#x3D;0}^{\sqrt[]{p}-1} A_{i,(i+j+k)%\sqrt[]{p}} \cdot B_{(i+j+k)%\sqrt[]{p},j}$$</p>
<p>$Canon,$算法基本思想是，每一个进程只存储$A,$、$B,$、$C,$矩阵的一个子块，本地相对应的$A,$、$B,$子块相乘后将结果累加到本地$C,$子块上，然后再与其他进程交换$A,$、$B,$子块，继续相乘将结果累加到本地$C,$子块上。但是要保证对应的$A,$、$B,$子块相乘，不能错位，我们注意到，在最开始，$P_{ij},$上的$A,$为所在行的第$j,$个，$B,$为所在列的第$i,$个，$A,$和$B,$子块并不对应，因此将一行上的$A,$子块循环向左移动$i,$格，一列上的$B,$子块循环向上移动$j,$格，这样$A,$和$B,$子块就能对应了，以后每执行一次计算，每行所有$A,$子块循环向左移动1格，每列上的$B,$子块循环向上移动1格，$A,$、$B,$子块相乘的结果累加在本地的$C,$子块上。<br><img src="https://img-blog.csdnimg.cn/4e1f8ac94d8e4de89df2fbd4707042fa.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBA5qWa5Zu95Luk5bC5,size_30,color_FFFFFF,t_70,g_se,x_16#pic_center" alt="排列"></p>
<p>算法的个步骤表示如下：</p>
<h2 id="1、第一次重排列"><a href="#1、第一次重排列" class="headerlink" title="1、第一次重排列"></a>1、第一次重排列</h2><p>$k&#x3D;0$时，$A_{i,(i+j)%\sqrt[]{p}}$并不处于$P_{ij},$上，需要对齐，于是$A_{i,(i+j)%\sqrt[]{p}}$传送到$P_{ij},$上，具体的实现方式是，二维网格上每一行的进程都将$A,$子块循环向左移位，第$i,$行的所有进程将$A,$子块循环向左移位$i,$个单位；同时$B_{(i+j)%\sqrt[]{p},j}$并不处于$P_{ij},$上，$B_{(i+j)%\sqrt[]{p},j}$传送到$P_{ij},$上，第$j,$列的所有进程将$B,$子块循环向上移位$j,$个单位，如下图所示：<br><img src="https://img-blog.csdnimg.cn/8b687e17cec04361b3581c9ba7c50838.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBA5qWa5Zu95Luk5bC5,size_30,color_FFFFFF,t_70,g_se,x_16#pic_center" alt="初始排列"><br>得到的第一次重排列后的矩阵排列为：<br><img src="https://img-blog.csdnimg.cn/af16cd39d9ef459b96bdef69b410355b.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBA5qWa5Zu95Luk5bC5,size_30,color_FFFFFF,t_70,g_se,x_16#pic_center" alt="第一次重排列后"><br>每个进程得到初次重排列后的$A,$、$B,$子块后，将$A,$、$B,$子块相乘的结果累加在本地的$C,$子块上。</p>
<h2 id="2、后续重排列"><a href="#2、后续重排列" class="headerlink" title="2、后续重排列"></a>2、后续重排列</h2><p>以后每进行一次计算，每行进程的$A,$子块都循环向左移动一个单位，每列进程的$B,$子块都循环的向上移动一个单位，如下图所示，$A,$、$B,$子块相乘的结果累加在本地的$C,$子块上，该步骤重复$\sqrt[]{p}-1,$次。<br><img src="https://img-blog.csdnimg.cn/643bf8d4f65c444288bc7e7eae8314fd.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBA5qWa5Zu95Luk5bC5,size_35,color_FFFFFF,t_70,g_se,x_16#pic_center" alt="重排列"><br>最后进程$P_{ij},$上能够得到本地的结果$C_{ij},$。</p>
<h1 id="三、程序实现"><a href="#三、程序实现" class="headerlink" title="三、程序实现"></a>三、程序实现</h1><h2 id="1、创建二维进程拓扑结构"><a href="#1、创建二维进程拓扑结构" class="headerlink" title="1、创建二维进程拓扑结构"></a>1、创建二维进程拓扑结构</h2><p>创建进程二维拓扑结构的函数为：</p>
<figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">dims[<span class="number">0</span>] = dims[<span class="number">1</span>] = <span class="built_in">sqrt</span>(comm_size);</span><br><span class="line">periods[<span class="number">0</span>] = periods[<span class="number">1</span>] = <span class="number">1</span>;</span><br><span class="line">MPI_Cart_create(MPI_COMM_WORLD, <span class="number">2</span>, dims, periods, <span class="number">1</span>, &amp;comm_2d);</span><br></pre></td></tr></table></figure>
<p>$comm_size,$为进程的总数量，$dims[2],$数组表示二维拓扑结构中每一维的大小，$period[2],$全部设置成1，表示拓扑结构的第$i,$维有环绕接。这样我们得到了新的进程通讯器$comm_2d,$。由于每一个进程都会被分配一个进程号以及进程坐标，从进程号获取进程坐标的函数如下：</p>
<figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">MPI_Cart_coords(comm_2d, myrank, <span class="number">2</span>, mycoords);</span><br></pre></td></tr></table></figure>
<p>$myrank,$是进程序号，$mycoords,$是大小为2的一维数组。</p>
<h2 id="2、输入输出矩阵"><a href="#2、输入输出矩阵" class="headerlink" title="2、输入输出矩阵"></a>2、输入输出矩阵</h2><p>输入输出矩阵均为$8\times8,$矩阵，$A,$、$B,$矩阵均为正交矩阵，且$B&#x3D;A^{T},$，$A,$矩阵为：<br><img src="https://img-blog.csdnimg.cn/26b328173bcf48a1bb1d0c38f6439597.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBA5qWa5Zu95Luk5bC5,size_30,color_FFFFFF,t_70,g_se,x_16#pic_center" alt="A矩阵"><br>计算结果应该可以得到一个单位矩阵。</p>
<h2 id="3、主程序"><a href="#3、主程序" class="headerlink" title="3、主程序"></a>3、主程序</h2><p>每个进程保存的本地矩阵子块分别为$local_A,$、$local_B,$、$local_C,$，方便起见，进程的数量设为1、4、16、64这4种情况中的一种。</p>
<figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#<span class="keyword">include</span><span class="string">&lt;stdio.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="keyword">include</span><span class="string">&lt;stdlib.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="keyword">include</span><span class="string">&lt;iostream&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="keyword">include</span><span class="string">&lt;math.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="keyword">include</span><span class="string">&lt;windows.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="keyword">include</span><span class="string">&lt;mpi.h&gt;</span></span></span><br><span class="line"></span><br><span class="line"><span class="meta">#<span class="keyword">define</span> pi acos(-1)</span></span><br><span class="line"></span><br><span class="line"><span class="type">int</span> myrank, comm_size, srcrank, dstrank;</span><br><span class="line"><span class="type">int</span> dims[<span class="number">2</span>], periods[<span class="number">2</span>], mycoords[<span class="number">2</span>], coords[<span class="number">2</span>];</span><br><span class="line"><span class="type">int</span> n = <span class="number">8</span>, local_n;</span><br><span class="line">MPI_Comm comm_2d;</span><br><span class="line"></span><br><span class="line"><span class="type">void</span> <span class="title function_">Multiply</span><span class="params">(<span class="type">double</span>* A, <span class="type">double</span>* B, <span class="type">double</span>* C, <span class="type">int</span> n)</span>;</span><br><span class="line"><span class="type">void</span> <span class="title function_">GenerateData</span><span class="params">(MPI_Comm comm_2d, <span class="type">double</span>* local_A, <span class="type">double</span>* local_B)</span>;</span><br><span class="line"><span class="type">void</span> <span class="title function_">Cannon</span><span class="params">(MPI_Comm comm_2d, <span class="type">double</span>* local_A, <span class="type">double</span>* local_B, <span class="type">double</span>* local_C)</span>;</span><br><span class="line"><span class="type">void</span> <span class="title function_">GatherResult</span><span class="params">(MPI_Comm comm_2d, <span class="type">double</span>* local_C)</span>;</span><br><span class="line"></span><br><span class="line"><span class="type">int</span> <span class="title function_">main</span><span class="params">()</span></span><br><span class="line">&#123;</span><br><span class="line">	<span class="type">double</span>* local_A, * local_B, * local_C;</span><br><span class="line"></span><br><span class="line">	MPI_Init(<span class="literal">NULL</span>, <span class="literal">NULL</span>);</span><br><span class="line">	MPI_Comm_size(MPI_COMM_WORLD, &amp;comm_size);</span><br><span class="line">	MPI_Comm_rank(MPI_COMM_WORLD, &amp;myrank);</span><br><span class="line"></span><br><span class="line">	<span class="keyword">if</span> (myrank == <span class="number">0</span>) <span class="built_in">printf</span>(<span class="string">&quot;Number of Process: %d\n&quot;</span>, comm_size);</span><br><span class="line"></span><br><span class="line">	<span class="type">double</span> start = MPI_Wtime();</span><br><span class="line">	</span><br><span class="line">	dims[<span class="number">0</span>] = dims[<span class="number">1</span>] = <span class="built_in">sqrt</span>(comm_size);</span><br><span class="line">	periods[<span class="number">0</span>] = periods[<span class="number">1</span>] = <span class="number">1</span>;</span><br><span class="line">	MPI_Cart_create(MPI_COMM_WORLD, <span class="number">2</span>, dims, periods, <span class="number">1</span>, &amp;comm_2d);</span><br><span class="line"></span><br><span class="line">	MPI_Comm_rank(comm_2d, &amp;myrank);</span><br><span class="line">	MPI_Cart_coords(comm_2d, myrank, <span class="number">2</span>, mycoords);</span><br><span class="line"></span><br><span class="line">	local_n = n / dims[<span class="number">0</span>];</span><br><span class="line">	local_A = (<span class="type">double</span>*)<span class="built_in">malloc</span>(<span class="keyword">sizeof</span>(<span class="type">double</span>) * local_n * local_n);</span><br><span class="line">	local_B = (<span class="type">double</span>*)<span class="built_in">malloc</span>(<span class="keyword">sizeof</span>(<span class="type">double</span>) * local_n * local_n);</span><br><span class="line">	local_C = (<span class="type">double</span>*)<span class="built_in">malloc</span>(<span class="keyword">sizeof</span>(<span class="type">double</span>) * local_n * local_n);</span><br><span class="line">	<span class="built_in">memset</span>(local_A, <span class="number">0</span>, <span class="keyword">sizeof</span>(<span class="type">double</span>) * local_n * local_n);</span><br><span class="line">	<span class="built_in">memset</span>(local_B, <span class="number">0</span>, <span class="keyword">sizeof</span>(<span class="type">double</span>) * local_n * local_n);</span><br><span class="line">	<span class="built_in">memset</span>(local_C, <span class="number">0</span>, <span class="keyword">sizeof</span>(<span class="type">double</span>) * local_n * local_n);</span><br><span class="line">	</span><br><span class="line">	GenerateData(comm_2d, local_A, local_B);</span><br><span class="line">	Cannon(comm_2d, local_A, local_B, local_C);</span><br><span class="line">	GatherResult(comm_2d, local_C);</span><br><span class="line"></span><br><span class="line">	<span class="type">double</span> end = MPI_Wtime();</span><br><span class="line">	<span class="keyword">if</span>(myrank == <span class="number">0</span>) <span class="built_in">printf</span>(<span class="string">&quot;Running time: %.2f seconds...\n&quot;</span>, end - start);</span><br><span class="line"></span><br><span class="line">	MPI_Comm_free(&amp;comm_2d);</span><br><span class="line">	MPI_Finalize();</span><br><span class="line">	<span class="keyword">return</span> <span class="number">0</span>;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure>
<h2 id="4、Cannon算法主函数"><a href="#4、Cannon算法主函数" class="headerlink" title="4、Cannon算法主函数"></a>4、Cannon算法主函数</h2><figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br></pre></td><td class="code"><pre><span class="line"><span class="type">void</span> <span class="title function_">Cannon</span><span class="params">(MPI_Comm comm_2d, <span class="type">double</span>* local_A, <span class="type">double</span>* local_B, <span class="type">double</span>* local_C)</span></span><br><span class="line">&#123;</span><br><span class="line">	<span class="type">int</span> uprank, downrank, leftrank, rightrank;</span><br><span class="line">	MPI_Status status;</span><br><span class="line"></span><br><span class="line">	<span class="comment">//计算左边一格的（目标）进程序号leftrank，和右边一格的（源）进程序号rightrank</span></span><br><span class="line">	coords[<span class="number">0</span>] = mycoords[<span class="number">0</span>];</span><br><span class="line">	coords[<span class="number">1</span>] = (mycoords[<span class="number">1</span>] - <span class="number">1</span>) % dims[<span class="number">1</span>];</span><br><span class="line">	MPI_Cart_rank(comm_2d, coords, &amp;leftrank);</span><br><span class="line">	coords[<span class="number">1</span>] = (mycoords[<span class="number">1</span>] + <span class="number">1</span>) % dims[<span class="number">1</span>];</span><br><span class="line">	MPI_Cart_rank(comm_2d, coords, &amp;rightrank);</span><br><span class="line">	</span><br><span class="line">	<span class="comment">//计算向上一格的（目标）进程序号uprank，和向下一格的（源）进程序号downrank</span></span><br><span class="line">	coords[<span class="number">0</span>] = (mycoords[<span class="number">0</span>] - <span class="number">1</span>) % dims[<span class="number">0</span>];</span><br><span class="line">	coords[<span class="number">1</span>] = mycoords[<span class="number">1</span>];</span><br><span class="line">	MPI_Cart_rank(comm_2d, coords, &amp;uprank);</span><br><span class="line">	coords[<span class="number">0</span>] = (mycoords[<span class="number">0</span>] + <span class="number">1</span>) % dims[<span class="number">0</span>];</span><br><span class="line">	MPI_Cart_rank(comm_2d, coords, &amp;downrank);</span><br><span class="line"></span><br><span class="line">	<span class="comment">//A矩阵第一次重排列</span></span><br><span class="line">	coords[<span class="number">0</span>] = mycoords[<span class="number">0</span>];</span><br><span class="line">	coords[<span class="number">1</span>] = (mycoords[<span class="number">1</span>] - mycoords[<span class="number">0</span>]) % dims[<span class="number">1</span>];</span><br><span class="line">	MPI_Cart_rank(comm_2d, coords, &amp;dstrank);</span><br><span class="line">	coords[<span class="number">1</span>] = (mycoords[<span class="number">1</span>] + mycoords[<span class="number">0</span>]) % dims[<span class="number">1</span>];</span><br><span class="line">	MPI_Cart_rank(comm_2d, coords, &amp;srcrank);</span><br><span class="line">	<span class="keyword">if</span> (myrank != dstrank)</span><br><span class="line">	&#123;</span><br><span class="line">		MPI_Sendrecv_replace(local_A, local_n * local_n, MPI_DOUBLE, dstrank, <span class="number">0</span>, srcrank, <span class="number">0</span>, comm_2d, &amp;status);</span><br><span class="line">	&#125;</span><br><span class="line"></span><br><span class="line">	<span class="comment">//B矩阵第一次重排列</span></span><br><span class="line">	coords[<span class="number">1</span>] = mycoords[<span class="number">1</span>];</span><br><span class="line">	coords[<span class="number">0</span>] = (mycoords[<span class="number">0</span>] - mycoords[<span class="number">1</span>]) % dims[<span class="number">0</span>];</span><br><span class="line">	MPI_Cart_rank(comm_2d, coords, &amp;dstrank);</span><br><span class="line">	coords[<span class="number">0</span>] = (mycoords[<span class="number">0</span>] + mycoords[<span class="number">1</span>]) % dims[<span class="number">0</span>];</span><br><span class="line">	MPI_Cart_rank(comm_2d, coords, &amp;srcrank);</span><br><span class="line">	<span class="keyword">if</span> (myrank != dstrank)</span><br><span class="line">	&#123;</span><br><span class="line">		MPI_Sendrecv_replace(local_B, local_n * local_n, MPI_DOUBLE, dstrank, <span class="number">0</span>, srcrank, <span class="number">0</span>, comm_2d, &amp;status);</span><br><span class="line">	&#125;</span><br><span class="line"></span><br><span class="line">	Multiply(local_A, local_B, local_C, local_n);</span><br><span class="line"></span><br><span class="line">	<span class="keyword">for</span> (<span class="type">int</span> time = <span class="number">0</span>; time &lt; dims[<span class="number">0</span>] - <span class="number">1</span>; time++)</span><br><span class="line">	&#123;</span><br><span class="line">		<span class="comment">//local_A循环往左滚动一格</span></span><br><span class="line">		MPI_Sendrecv_replace(local_A, local_n * local_n, MPI_DOUBLE, leftrank, <span class="number">0</span>, rightrank, <span class="number">0</span>, comm_2d, &amp;status);</span><br><span class="line">		</span><br><span class="line">		<span class="comment">//local_B循环往上滚动一个</span></span><br><span class="line">		MPI_Sendrecv_replace(local_B, local_n * local_n, MPI_DOUBLE, uprank, <span class="number">0</span>, downrank, <span class="number">0</span>, comm_2d, &amp;status);</span><br><span class="line"></span><br><span class="line">		Multiply(local_A, local_B, local_C, local_n);</span><br><span class="line">	&#125;</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line"><span class="type">void</span> <span class="title function_">Multiply</span><span class="params">(<span class="type">double</span>* A, <span class="type">double</span>* B, <span class="type">double</span>* C, <span class="type">int</span> n)</span></span><br><span class="line">&#123;</span><br><span class="line">	<span class="keyword">for</span> (<span class="type">int</span> i = <span class="number">0</span>; i &lt; n; i++)</span><br><span class="line">		<span class="keyword">for</span> (<span class="type">int</span> j = <span class="number">0</span>; j &lt; n; j++)</span><br><span class="line">			<span class="keyword">for</span> (<span class="type">int</span> k = <span class="number">0</span>; k &lt; n; k++)</span><br><span class="line">			&#123;</span><br><span class="line">				C[i * n + j] += A[i * n + k] * B[j + k * n];</span><br><span class="line">				Sleep(<span class="number">10</span>);</span><br><span class="line">			&#125;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure>
<p>其中$MPI_Cart_rank,$用于将进程坐标转换为进程号，$MPI_Sendrecv_replace,$函数可以视为$MPI_Send$以及$MPI_Recv,$函数的组合，用于在一个绕接环中，每一个进程向目标进程$dstrank$发送数据，并接受来自$srcrank$源进程的数据，并且在收发数据中所有进程使用的都是同一个缓存。使用该函数可以实现$A$、$B$子块的循环移位。</p>
<h2 id="5、生成数据并分发到各进程的函数"><a href="#5、生成数据并分发到各进程的函数" class="headerlink" title="5、生成数据并分发到各进程的函数"></a>5、生成数据并分发到各进程的函数</h2><p>0号进程将$A$、$B$矩阵的数据放入$bufferA$、$bufferB$中再发送给对应进程的$local_A$、$local_B$中。</p>
<figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br></pre></td><td class="code"><pre><span class="line"><span class="type">void</span> <span class="title function_">GenerateData</span><span class="params">(MPI_Comm comm_2d, <span class="type">double</span>* local_A, <span class="type">double</span>* local_B)</span></span><br><span class="line">&#123;</span><br><span class="line">	<span class="keyword">if</span> (mycoords[<span class="number">0</span>] == <span class="number">0</span> &amp;&amp; mycoords[<span class="number">1</span>] == <span class="number">0</span>)			<span class="comment">//0号进程生成和发送数据</span></span><br><span class="line">	&#123;</span><br><span class="line">		<span class="type">double</span> A[<span class="number">8</span> * <span class="number">8</span>] = &#123;</span><br><span class="line">		<span class="built_in">cos</span>(pi / <span class="number">6</span>), -<span class="built_in">sin</span>(pi / <span class="number">6</span>), <span class="number">0</span>, <span class="number">0</span> , <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>,</span><br><span class="line">		<span class="built_in">sin</span>(pi / <span class="number">6</span>), <span class="built_in">cos</span>(pi / <span class="number">6</span>), <span class="number">0</span>, <span class="number">0</span> , <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>,</span><br><span class="line">		<span class="number">0</span>, <span class="number">0</span>, <span class="built_in">cos</span>(pi / <span class="number">4</span>), -<span class="built_in">sin</span>(pi / <span class="number">4</span>), <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>,</span><br><span class="line">		<span class="number">0</span>, <span class="number">0</span>, <span class="built_in">sin</span>(pi / <span class="number">4</span>), <span class="built_in">cos</span>(pi / <span class="number">4</span>), <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>,</span><br><span class="line">		<span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="built_in">cos</span>(pi / <span class="number">6</span>), -<span class="built_in">sin</span>(pi / <span class="number">6</span>), <span class="number">0</span>, <span class="number">0</span>,</span><br><span class="line">		<span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="built_in">sin</span>(pi / <span class="number">6</span>), <span class="built_in">cos</span>(pi / <span class="number">6</span>), <span class="number">0</span>, <span class="number">0</span>,</span><br><span class="line">		<span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="built_in">cos</span>(pi / <span class="number">4</span>),  -<span class="built_in">sin</span>(pi / <span class="number">4</span>),</span><br><span class="line">		<span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="number">0</span>, <span class="built_in">sin</span>(pi / <span class="number">4</span>), <span class="built_in">cos</span>(pi / <span class="number">4</span>),</span><br><span class="line">		&#125;;</span><br><span class="line"></span><br><span class="line">		<span class="type">double</span>* B = (<span class="type">double</span>*)<span class="built_in">malloc</span>(<span class="keyword">sizeof</span>(<span class="type">double</span>) * n * n);</span><br><span class="line">		<span class="built_in">memset</span>(B, <span class="number">0</span>, <span class="keyword">sizeof</span>(<span class="type">double</span>) * n * n);</span><br><span class="line">		<span class="keyword">for</span> (<span class="type">int</span> i = <span class="number">0</span>; i &lt; n; i++)</span><br><span class="line">			<span class="keyword">for</span> (<span class="type">int</span> j = <span class="number">0</span>; j &lt; n; j++)</span><br><span class="line">				B[j * n + i] = A[i * n + j];</span><br><span class="line"></span><br><span class="line">		<span class="keyword">for</span> (<span class="type">int</span> i = <span class="number">0</span>; i &lt; local_n; i++)</span><br><span class="line">			<span class="keyword">for</span> (<span class="type">int</span> j = <span class="number">0</span>; j &lt; local_n; j++)</span><br><span class="line">			&#123;</span><br><span class="line">				local_A[i * local_n + j] = A[i * n + j];</span><br><span class="line">				local_B[i * local_n + j] = B[i * n + j];</span><br><span class="line">			&#125;</span><br><span class="line"></span><br><span class="line">		<span class="keyword">for</span> (<span class="type">int</span> row = <span class="number">0</span>; row &lt; dims[<span class="number">0</span>]; row++)</span><br><span class="line">			<span class="keyword">for</span> (<span class="type">int</span> col = <span class="number">0</span>; col &lt; dims[<span class="number">1</span>]; col++)</span><br><span class="line">			&#123;</span><br><span class="line">				<span class="keyword">if</span> (row == <span class="number">0</span> &amp;&amp; col == <span class="number">0</span>) <span class="keyword">continue</span>;		<span class="comment">//0号进程</span></span><br><span class="line">				<span class="comment">//offset是坐标为(row,col)进程对应的A、B子块的第一个元素在A、B矩阵中的下标</span></span><br><span class="line">				<span class="type">int</span> offset = row * dims[<span class="number">1</span>] * local_n * local_n + col * local_n;</span><br><span class="line">				<span class="type">double</span>* bufferA = (<span class="type">double</span>*)<span class="built_in">malloc</span>(<span class="keyword">sizeof</span>(<span class="type">double</span>) * local_n * local_n);</span><br><span class="line">				<span class="type">double</span>* bufferB = (<span class="type">double</span>*)<span class="built_in">malloc</span>(<span class="keyword">sizeof</span>(<span class="type">double</span>) * local_n * local_n);</span><br><span class="line"></span><br><span class="line">				<span class="keyword">for</span> (<span class="type">int</span> i = <span class="number">0</span>; i &lt; local_n; i++)</span><br><span class="line">					<span class="keyword">for</span> (<span class="type">int</span> j = <span class="number">0</span>; j &lt; local_n; j++)</span><br><span class="line">					&#123;</span><br><span class="line">						bufferA[i * local_n + j] = A[offset + i * n + j];</span><br><span class="line">						bufferB[i * local_n + j] = B[offset + i * n + j];</span><br><span class="line">					&#125;</span><br><span class="line"></span><br><span class="line">				coords[<span class="number">0</span>] = row;</span><br><span class="line">				coords[<span class="number">1</span>] = col;</span><br><span class="line">				MPI_Cart_rank(comm_2d, coords, &amp;dstrank);</span><br><span class="line">				MPI_Send(bufferA, local_n * local_n, MPI_DOUBLE, dstrank, <span class="number">0</span>, comm_2d);</span><br><span class="line">				MPI_Send(bufferB, local_n * local_n, MPI_DOUBLE, dstrank, <span class="number">1</span>, comm_2d);</span><br><span class="line">				<span class="built_in">free</span>(bufferA);</span><br><span class="line">				<span class="built_in">free</span>(bufferB);</span><br><span class="line">			&#125;</span><br><span class="line">		<span class="built_in">free</span>(B);</span><br><span class="line">	&#125;</span><br><span class="line">	<span class="keyword">else</span></span><br><span class="line">	&#123;</span><br><span class="line">		MPI_Recv(local_A, local_n * local_n, MPI_DOUBLE, <span class="number">0</span>, <span class="number">0</span>, comm_2d, MPI_STATUS_IGNORE);</span><br><span class="line">		MPI_Recv(local_B, local_n * local_n, MPI_DOUBLE, <span class="number">0</span>, <span class="number">1</span>, comm_2d, MPI_STATUS_IGNORE);</span><br><span class="line">	&#125;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure>
<h2 id="6、收集结果到0号进程的函数"><a href="#6、收集结果到0号进程的函数" class="headerlink" title="6、收集结果到0号进程的函数"></a>6、收集结果到0号进程的函数</h2><p>所有进程将计算结果（本地$C$子块的数据）放入$bufferC$中发送给0号进程，0号进程收集$bufferC$中的数据放入$C$矩阵的对应位置中。</p>
<figure class="highlight c"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br></pre></td><td class="code"><pre><span class="line"><span class="type">void</span> <span class="title function_">GatherResult</span><span class="params">(MPI_Comm comm_2d, <span class="type">double</span>* local_C)</span></span><br><span class="line">&#123;</span><br><span class="line">	<span class="type">int</span> otherrank;</span><br><span class="line">	MPI_Status status;</span><br><span class="line"></span><br><span class="line">	<span class="keyword">if</span> (coords[<span class="number">0</span>] == <span class="number">0</span> &amp;&amp; coords[<span class="number">1</span>] == <span class="number">0</span>)</span><br><span class="line">	&#123;</span><br><span class="line">		<span class="type">double</span>* C = (<span class="type">double</span>*)<span class="built_in">malloc</span>(<span class="keyword">sizeof</span>(<span class="type">double</span>) * n * n);</span><br><span class="line">		<span class="built_in">memset</span>(C, <span class="number">0</span>, <span class="keyword">sizeof</span>(<span class="type">double</span>) * n * n);</span><br><span class="line">		<span class="type">double</span>* bufferC = (<span class="type">double</span>*)<span class="built_in">malloc</span>(<span class="keyword">sizeof</span>(<span class="type">double</span>) * local_n * local_n);</span><br><span class="line">		<span class="built_in">memset</span>(bufferC, <span class="number">0</span>, <span class="keyword">sizeof</span>(<span class="type">double</span>) * local_n * local_n);</span><br><span class="line">		<span class="keyword">for</span> (<span class="type">int</span> row = <span class="number">0</span>; row &lt; dims[<span class="number">0</span>]; row++)</span><br><span class="line">		&#123;</span><br><span class="line">			<span class="keyword">for</span> (<span class="type">int</span> col = <span class="number">0</span>; col &lt; dims[<span class="number">1</span>]; col++)</span><br><span class="line">			&#123;</span><br><span class="line">				<span class="keyword">if</span> (row == <span class="number">0</span> &amp;&amp; col == <span class="number">0</span>)</span><br><span class="line">				&#123;</span><br><span class="line">					<span class="keyword">for</span> (<span class="type">int</span> i = <span class="number">0</span>; i &lt; local_n; i++)</span><br><span class="line">						<span class="keyword">for</span> (<span class="type">int</span> j = <span class="number">0</span>; j &lt; local_n; j++)</span><br><span class="line">							C[i * n + j] = local_C[i * local_n + j];</span><br><span class="line">					<span class="keyword">continue</span>;</span><br><span class="line">				&#125;</span><br><span class="line">				coords[<span class="number">0</span>] = row;</span><br><span class="line">				coords[<span class="number">1</span>] = col;</span><br><span class="line">				MPI_Cart_rank(comm_2d, coords, &amp;otherrank);</span><br><span class="line">				MPI_Recv(bufferC, local_n * local_n, MPI_DOUBLE, otherrank, <span class="number">0</span>, comm_2d, &amp;status);</span><br><span class="line">				<span class="type">int</span> offset = row * dims[<span class="number">1</span>] * local_n * local_n + col * local_n;</span><br><span class="line"></span><br><span class="line">				<span class="keyword">for</span> (<span class="type">int</span> i = <span class="number">0</span>; i &lt; local_n; i++)</span><br><span class="line">					<span class="keyword">for</span> (<span class="type">int</span> j = <span class="number">0</span>; j &lt; local_n; j++)</span><br><span class="line">						C[offset + i * n + j] = bufferC[i * local_n + j];</span><br><span class="line">					</span><br><span class="line">			&#125;</span><br><span class="line">		&#125;</span><br><span class="line">		<span class="built_in">free</span>(bufferC);</span><br><span class="line">		<span class="keyword">for</span> (<span class="type">int</span> i = <span class="number">0</span>; i &lt; n; i++)</span><br><span class="line">		&#123;</span><br><span class="line">			<span class="keyword">for</span> (<span class="type">int</span> j = <span class="number">0</span>; j &lt; n; j++)</span><br><span class="line">				<span class="built_in">printf</span>(<span class="string">&quot;%.2f &quot;</span>, C[i * n + j]);</span><br><span class="line">			<span class="built_in">printf</span>(<span class="string">&quot;\n&quot;</span>);</span><br><span class="line">		&#125;</span><br><span class="line">	&#125;</span><br><span class="line">	<span class="keyword">else</span></span><br><span class="line">	&#123;</span><br><span class="line">		MPI_Send(local_C, local_n * local_n, MPI_DOUBLE, <span class="number">0</span>, <span class="number">0</span>, comm_2d);</span><br><span class="line">	&#125;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure>
<h1 id="四、实现结果"><a href="#四、实现结果" class="headerlink" title="四、实现结果"></a>四、实现结果</h1><p>程序在$VS2019$上运行，可以看到随着进程数量的增加，$0$号进程的运行时间明显减少（显示器上显示的执行时间是$0$号进程的执行时间）。但是当进程增加到原来的$n$倍，$0$号进程的运行时间并不为原来的$\frac{1}{n}$，这一方面是因为$0$号进程需要与更多的进程点对点发送$A$、$B$矩阵的数据，另一个重要原因是我的电脑为$Intel$ $8$核$CPU$，最多只能有$8$个进程同时运行，因此会有$64-8&#x3D;56$个进程会在等待队列和就绪队列上等待被$CPU$调度，影响总程序运行时间。但是多个进程确实明显加速了矩阵乘法。<br><img src="https://img-blog.csdnimg.cn/bf1ed6fdf2b64112bb938e070825e933.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBA5qWa5Zu95Luk5bC5,size_20,color_FFFFFF,t_70,g_se,x_16#pic_center" alt="在这里插入图片描述"></p>

      
    </div>
    <footer class="article-footer">
      <a data-url="https://chudod.gitee.io/myworks/2022/03/26/Cannon/" data-id="cl3kznqzl0000j0uk2da00qjp" data-title="Cannon矩阵乘法" class="article-share-link">Share</a>
      
      
      
  <ul class="article-tag-list" itemprop="keywords"><li class="article-tag-list-item"><a class="article-tag-list-link" href="/myworks/tags/MPI/" rel="tag">MPI</a></li><li class="article-tag-list-item"><a class="article-tag-list-link" href="/myworks/tags/%E5%B9%B6%E8%A1%8C%E7%AE%97%E6%B3%95/" rel="tag">并行算法</a></li><li class="article-tag-list-item"><a class="article-tag-list-link" href="/myworks/tags/%E5%B9%B6%E8%A1%8C%E8%AE%A1%E7%AE%97/" rel="tag">并行计算</a></li></ul>

    </footer>
  </div>
  
</article>



  


  <nav id="page-nav">
    
    <a class="extend prev" rel="prev" href="/myworks/">&laquo; Prev</a><a class="page-number" href="/myworks/">1</a><span class="page-number current">2</span><a class="page-number" href="/myworks/page/3/">3</a><a class="extend next" rel="next" href="/myworks/page/3/">Next &raquo;</a>
  </nav>

</section>
        
          <aside id="sidebar">
  
    
  <div class="widget-wrap">
    <h3 class="widget-title">Categories</h3>
    <div class="widget">
      <ul class="category-list"><li class="category-list-item"><a class="category-list-link" href="/myworks/categories/English/">English</a></li><li class="category-list-item"><a class="category-list-link" href="/myworks/categories/%D0%A0%D0%BE%D1%81%D1%81%D0%B8%D1%8F/">Россия</a></li><li class="category-list-item"><a class="category-list-link" href="/myworks/categories/%E5%AF%BC%E8%88%AA/">导航</a></li><li class="category-list-item"><a class="category-list-link" href="/myworks/categories/%E5%B9%B6%E8%A1%8C%E8%AE%A1%E7%AE%97/">并行计算</a><ul class="category-list-child"><li class="category-list-item"><a class="category-list-link" href="/myworks/categories/%E5%B9%B6%E8%A1%8C%E8%AE%A1%E7%AE%97/In-Network-Collectives/">In-Network Collectives</a></li><li class="category-list-item"><a class="category-list-link" href="/myworks/categories/%E5%B9%B6%E8%A1%8C%E8%AE%A1%E7%AE%97/MPI%E7%A8%8B%E5%BA%8F%E8%AE%BE%E8%AE%A1/">MPI程序设计</a></li><li class="category-list-item"><a class="category-list-link" href="/myworks/categories/%E5%B9%B6%E8%A1%8C%E8%AE%A1%E7%AE%97/MPI%E7%A8%8B%E5%BA%8F%E8%AE%BE%E8%AE%A1%E4%B8%8E%E5%B9%B6%E8%A1%8C%E7%AE%97%E6%B3%95/">MPI程序设计与并行算法</a></li><li class="category-list-item"><a class="category-list-link" href="/myworks/categories/%E5%B9%B6%E8%A1%8C%E8%AE%A1%E7%AE%97/MPI%E8%BD%AF%E4%BB%B6%E6%A0%88%E8%AE%BE%E8%AE%A1/">MPI软件栈设计</a></li><li class="category-list-item"><a class="category-list-link" href="/myworks/categories/%E5%B9%B6%E8%A1%8C%E8%AE%A1%E7%AE%97/MPI%E9%9B%86%E5%90%88%E9%80%9A%E4%BF%A1%E7%AE%97%E6%B3%95/">MPI集合通信算法</a></li><li class="category-list-item"><a class="category-list-link" href="/myworks/categories/%E5%B9%B6%E8%A1%8C%E8%AE%A1%E7%AE%97/Topology-Aware-Algorithms/">Topology-Aware Algorithms</a></li></ul></li><li class="category-list-item"><a class="category-list-link" href="/myworks/categories/%E6%95%B4%E6%B4%BB%E7%B3%BB%E5%88%97/">整活系列</a></li><li class="category-list-item"><a class="category-list-link" href="/myworks/categories/%E7%A1%AC%E6%A0%B8%E7%8B%A0%E4%BA%BA/">硬核狠人</a></li><li class="category-list-item"><a class="category-list-link" href="/myworks/categories/%E8%AE%A1%E7%AE%97%E6%9C%BA%E5%9F%BA%E7%A1%80%E7%9F%A5%E8%AF%86/">计算机基础知识</a><ul class="category-list-child"><li class="category-list-item"><a class="category-list-link" href="/myworks/categories/%E8%AE%A1%E7%AE%97%E6%9C%BA%E5%9F%BA%E7%A1%80%E7%9F%A5%E8%AF%86/%E6%93%8D%E4%BD%9C%E7%B3%BB%E7%BB%9F/">操作系统</a></li></ul></li></ul>
    </div>
  </div>


  
    
  <div class="widget-wrap">
    <h3 class="widget-title">Tags</h3>
    <div class="widget">
      <ul class="tag-list" itemprop="keywords"><li class="tag-list-item"><a class="tag-list-link" href="/myworks/tags/English-Vocabulary/" rel="tag">English Vocabulary</a></li><li class="tag-list-item"><a class="tag-list-link" href="/myworks/tags/English-Writing/" rel="tag">English Writing</a></li><li class="tag-list-item"><a class="tag-list-link" href="/myworks/tags/MPI/" rel="tag">MPI</a></li><li class="tag-list-item"><a class="tag-list-link" href="/myworks/tags/Topology-Aware/" rel="tag">Topology-Aware</a></li><li class="tag-list-item"><a class="tag-list-link" href="/myworks/tags/%D0%A0%D0%BE%D1%81%D1%81%D0%B8%D1%8F/" rel="tag">Россия</a></li><li class="tag-list-item"><a class="tag-list-link" href="/myworks/tags/%E5%9C%A8%E7%BD%91%E8%AE%A1%E7%AE%97/" rel="tag">在网计算</a></li><li class="tag-list-item"><a class="tag-list-link" href="/myworks/tags/%E5%B9%B6%E8%A1%8C%E7%AE%97%E6%B3%95/" rel="tag">并行算法</a></li><li class="tag-list-item"><a class="tag-list-link" href="/myworks/tags/%E5%B9%B6%E8%A1%8C%E8%AE%A1%E7%AE%97/" rel="tag">并行计算</a></li><li class="tag-list-item"><a class="tag-list-link" href="/myworks/tags/%E6%93%8D%E4%BD%9C%E7%B3%BB%E7%BB%9F/" rel="tag">操作系统</a></li><li class="tag-list-item"><a class="tag-list-link" href="/myworks/tags/%E6%95%B4%E6%B4%BB/" rel="tag">整活</a></li><li class="tag-list-item"><a class="tag-list-link" href="/myworks/tags/%E7%94%9F%E6%B4%BB/" rel="tag">生活</a></li><li class="tag-list-item"><a class="tag-list-link" href="/myworks/tags/%E7%A1%AC%E6%A0%B8%E7%8B%A0%E4%BA%BA/" rel="tag">硬核狠人</a></li><li class="tag-list-item"><a class="tag-list-link" href="/myworks/tags/%E9%9B%86%E5%90%88%E9%80%9A%E4%BF%A1/" rel="tag">集合通信</a></li><li class="tag-list-item"><a class="tag-list-link" href="/myworks/tags/%E9%AB%98%E6%80%A7%E8%83%BD%E8%AE%A1%E7%AE%97/" rel="tag">高性能计算</a></li></ul>
    </div>
  </div>


  
    
  <div class="widget-wrap">
    <h3 class="widget-title">Tag Cloud</h3>
    <div class="widget tagcloud">
      <a href="/myworks/tags/English-Vocabulary/" style="font-size: 10px;">English Vocabulary</a> <a href="/myworks/tags/English-Writing/" style="font-size: 10px;">English Writing</a> <a href="/myworks/tags/MPI/" style="font-size: 17.5px;">MPI</a> <a href="/myworks/tags/Topology-Aware/" style="font-size: 10px;">Topology-Aware</a> <a href="/myworks/tags/%D0%A0%D0%BE%D1%81%D1%81%D0%B8%D1%8F/" style="font-size: 12.5px;">Россия</a> <a href="/myworks/tags/%E5%9C%A8%E7%BD%91%E8%AE%A1%E7%AE%97/" style="font-size: 12.5px;">在网计算</a> <a href="/myworks/tags/%E5%B9%B6%E8%A1%8C%E7%AE%97%E6%B3%95/" style="font-size: 10px;">并行算法</a> <a href="/myworks/tags/%E5%B9%B6%E8%A1%8C%E8%AE%A1%E7%AE%97/" style="font-size: 20px;">并行计算</a> <a href="/myworks/tags/%E6%93%8D%E4%BD%9C%E7%B3%BB%E7%BB%9F/" style="font-size: 15px;">操作系统</a> <a href="/myworks/tags/%E6%95%B4%E6%B4%BB/" style="font-size: 10px;">整活</a> <a href="/myworks/tags/%E7%94%9F%E6%B4%BB/" style="font-size: 10px;">生活</a> <a href="/myworks/tags/%E7%A1%AC%E6%A0%B8%E7%8B%A0%E4%BA%BA/" style="font-size: 10px;">硬核狠人</a> <a href="/myworks/tags/%E9%9B%86%E5%90%88%E9%80%9A%E4%BF%A1/" style="font-size: 15px;">集合通信</a> <a href="/myworks/tags/%E9%AB%98%E6%80%A7%E8%83%BD%E8%AE%A1%E7%AE%97/" style="font-size: 12.5px;">高性能计算</a>
    </div>
  </div>

  
    
  <div class="widget-wrap">
    <h3 class="widget-title">Archives</h3>
    <div class="widget">
      <ul class="archive-list"><li class="archive-list-item"><a class="archive-list-link" href="/myworks/archives/2022/05/">May 2022</a></li><li class="archive-list-item"><a class="archive-list-link" href="/myworks/archives/2022/04/">April 2022</a></li><li class="archive-list-item"><a class="archive-list-link" href="/myworks/archives/2022/03/">March 2022</a></li></ul>
    </div>
  </div>


  
    
  <div class="widget-wrap">
    <h3 class="widget-title">Recent Posts</h3>
    <div class="widget">
      <ul>
        
          <li>
            <a href="/myworks/2022/05/24/Russian_Grammar/">Русская Граммаатика(俄语语法)</a>
          </li>
        
          <li>
            <a href="/myworks/2022/05/18/OS_FileSystem/">文件系统概念和Ext2文件系统</a>
          </li>
        
          <li>
            <a href="/myworks/2022/05/14/English_Sentences/">English Sentences</a>
          </li>
        
          <li>
            <a href="/myworks/2022/05/11/English_Vocabulary/">English Vocabulary</a>
          </li>
        
          <li>
            <a href="/myworks/2022/05/11/Russian_Vocabulary/">Русские Слова(俄语单词)</a>
          </li>
        
      </ul>
    </div>
  </div>

  
</aside>
        
      </div>
      <footer id="footer">
  
  <div class="outer">
    <div id="footer-info" class="inner">
      
      &copy; 2022 John Doe<br>
      Powered by <a href="https://hexo.io/" target="_blank">Hexo</a>
    </div>
  </div>
</footer>

    </div>
    <nav id="mobile-nav">
  
    <a href="/myworks/" class="mobile-nav-link">Home</a>
  
    <a href="/myworks/archives" class="mobile-nav-link">Archives</a>
  
</nav>
    


<script src="/myworks/js/jquery-3.4.1.min.js"></script>



  
<script src="/myworks/fancybox/jquery.fancybox.min.js"></script>




<script src="/myworks/js/script.js"></script>





  </div>
</body>
</html>